
Creating the FAP4 publickey encryption chipIn the early 1980s, when publickey encryption was starting to be important, one of the biggest obstacles to its widespread adoption was the amount of arithmetic it required. This may sound strange, given that the purpose of computers is to compute, but this was serious arithmetic: not just the 5digit numbers that microprocessors could handle then, or the 10digit numbers they can handle now. Calculating or checking a single digital signature required thousands of calculations on huge numbers: 150 or even 300 digits long. Even on today's computers that adds up to around 3,000,000 multiplications: in the 1980s, it was not unusual for a digital signature operation to take 20 seconds, far too long for it to be usable. The obvious thing to do was to design a specialpurpose chip to do the encryption. Such a chip could do 300digit arithmetic directly, making it between a hundred and a thousand times faster than the microprocessor. I invented such a chip, using some new techniques to increase its speed, and this is its story. M.J. Kochanski, "Developing an RSA Chip", in Advances in Cryptology: Proceedings of CRYPTO 85, SpringerVerlag, Berlin (1985), 3540164634. The only documentation of the FAP4 chip so far has been a conference paper. This article could have been another conference paper, full of formulae and references to the existing technical literature. I have chosen, however, to write it in a form that can be read and appreciated by nonspecialists. Mathematicians and computer scientists will find a lot that they know already; but I think the story is interesting enough to be worth telling so that real people can follow it. If you really think you know everything you need to know about longinteger arithmetic, you can skip the explanations. Modular multiplicationThe operation at the core of all the major publickey cryptosystems (and related things such as keyexchange protocols) is modular multiplication. Modular multiplication is simple to describe: to multiply A by B modulo R is equivalent to calculating A×B, dividing the result by R, throwing away the quotient, and keeping the remainder. For example, 10×10=9 (mod 13). Computationally, the nice thing about arithmetic modulo R (you can define modular addition and subtraction similarly) is that because you are forever dividing by R and taking the remainder, no number that you handle ever needs to be bigger than R. So if we design a chip that can handle numbers up to R, it will be able to deal with any arithmetic that we throw at it. Since R in a practical cryptosystem is between 512 and 1024 bits long (about 154 to 308 decimal digits), we need to have a chip that handles numbers of that size: let's call it 300 digits for convenience. How do you design a chip for 300digit modular multiplications? The first step is to deal with multiplication on its own: we can look at modulo reduction later. How to multiplyThe method is called long multiplication, and you probably learned it in primary school when you were six or seven. Here, to remind you, is a simple example. To calculate 789098× 123456:
You may have noticed that I've assumed that you've been given the singledigit multiples of 123456 (7×123456 and so on) "on a plate". There is a reason for this, which I'll come to later. Now the thing to notice about this calculation is that it involves about 6×6=36 additions. This is a general rule: if you multiply 300digit numbers then you'll end up doing about 300×300=90,000 additions. The challenge is to reduce this number. Before going any further I'd like to rewrite the calculation shown above.
This looks rather longwinded, but only the lines shown in yellow actually do any real calculations: the multiplications by ten simply move the numbers left by one step, which isn't really arithmetic at all. This gets much nearer the way a chip will work, and the other good thing about it is that only the last seven columns in the table are involved in additions: when this becomes a chip, this will save a lot of circuitry. You may be wondering why all my arithmetic uses decimal digits when everyone knows that computer chips work in binary and use just 0s and 1s. Here's why. Now, about those 6×6 (or 300×300) additions: what I should have said is that these were singledigit additions. If the chip contained only a single circuit for addition, then the whole multiplication would indeed take 90,000 times as long as a single addition. But adders are tiny things, and it is quite reasonable to put 300 of them on a chip. In that case, they could all work simultaneously, add two 300digit numbers in one step instead of 300 separate 1digit steps, and thus reduce the time taken to multiply by a factor of 300. This is an example of a "timespace tradeoff", which is a very common phenomenon in computing. Each step involves more circuitry and becomes a bit more complex, but far fewer steps are required. So the remaining question is: how fast can we add? Can 300 adders work together to add two 300digit numbers in the same time that a single adder can add two single digits? The answer is: not quite. In fact, unless we do something about it, the 300 adders will hardly save us any time at all. Why adding isn't as fast as you'd expectLet's go back the first step in our sample multiplication, 8641920 + 987648, and do it out loud as we would at school. We go from right to left. 0+8=8, that's easy; 2+4=6, that's easy too; 9+6=15, so we write down 5 and carry 1; 1+7+1=9; 4+8=12, so we write down 2 and carry 1; 6+9+1=16, so we write down 6 and carry 1; 8+0+1=9. It is those carries that cause the trouble. If it weren't for carries, we could build seven separate adders (let's call them Alice, Bob, Carol and so on) and each one could process its digits at the same time as all the others. But we do have carries. Alice can't be certain of her result until she knows whether Bob is going to give her a carry. How soon can she discover this? Let's look at Bob. Bob doesn't yet know if Carol is going to give him a carry or not. If the sum of the digits that he has been given is less than 9, this doesn't matter because even if Carol did come up with a carry, the result would still be less than 10 and there'd be no need to pass a carry to Alice: thus Bob can report "no carry" to Alice without having to wait. If the sum of Bob's digits is 10 or more, there's no problem either, because Bob knows at once that he has to report "carry" to Alice and he can do this without waiting for Carol. But if Bob's sum is 9 then the he can't tell whether he is going to have a carry or not: he has to wait until Carol has decided whether to give him a carry. "No carry" from Carol means that Bob's result can stay at 9 and he can report "no carry" to Alice; "carry" from Carol means that Bob's result becomes 0 and he has to report "carry" to Alice. Either way, Bob can't report to Alice before hearing from Carol. It gets worse. If the digits that Carol was given add up to 9, then Carol can't report to Bob without hearing from Diana. And you never know, Diana may have to wait to hear from Ed, and Ed may have to wait for Frances, and so on right through all the adders. This means that the simplest way of adding faster doesn't really work. Most of the time we get the result pretty quickly; but there are times when we have to wait for a carry to propagate itself step by step through a long sequence of adders. We have to slow down the entire system to allow for this, and it ends up very little faster than the original simple singledigit adder. An improvement: lookahead carryThere is a known method for dealing with this, called lookahead carry. It is clever but it's also ugly, and for really large numbers it's not enough help: with numbers of the size we're talking about, addition will still end up at least 20 times slower than if we didn't have to use carries at all. The solution: carrysave integersThe basic idea of carrysave integers is a very simple one. All we do is invent a new digit with the value 10. Let's call it X. So, for example, 4X is just another way of writing 50, and 19X is the same as 1X0 which is the same as 200. What benefit does this bring? Before we invented X, there was a problem with the way an adder had to behave when the sum of its digits was 9. The rule was "If the adder to your right doesn't give you a carry, your result is 9 and you don't need to pass a carry to your left; if your righthand neighbour did give you a carry, your result is 0 and you need to pass a carry to your left". In other words, "You can't decide whether to pass a carry to your left until you know whether you've received one". Now that we have X, we can change the rule for when the sum of digits is 9. Now it reads "If you don't receive a carry, your result is 9 and you don't pass on a carry; if you do receive a carry, your result is X and you don't pass on a carry either". In other words, "Never pass on a carry". So now Bob's behaviour is quick and simple. If the total of his digits is 9 or less, he doesn't pass a carry to Alice; if the total is 10 or more, he does. If Carol then passes a carry to Bob, Bob adds 1 to the result he has calculated. 0 becomes 1, 1 becomes 2, ... 8 becomes 9, and 9 becomes X. There is never any need to worry Alice about anything that Carol may have said. The problem of propagating carries has disappeared. You can avoid most of this conversion step by doing the multiplication "backwards", starting at the righthand end of the multiplier instead of starting on the left. I won't go into this further because it doesn't work well with modular multiplication, but here, for completeness, is a surreal modular multiplication technique that does work from right to left. Of course, X is not really an official digit, so at the end of the multiplication you'll want to have the result in conventional numbers without any Xs in it. The easiest thing is to pause for a moment at the end of each multiplication and let carries propagate from one adder to the next. This is exactly what we were doing before carrysave numbers came in, except that then we were having to do it on every single addition and now we have to do it only once at the very end of a multiplication. In my chip design, this final "convert to a conventional integer" stage added only 1% to the time taken by a multiplication. So is everything coming up roses? For a multiplication chip, yes. But what we really want to do is modular multiplication, and so far I've completely ignored the fact that we're doing arithmetic modulo R. Modular multiplication at no costArithmetic modulo R means that multiples of R don't count, because we're going to throw them away in the end anyway. The formal definition of A×B (mod R) may have said "multiply, then divide", but there is really no need to do it like that. In fact, doing it that way is a waste of time and space. Let's go back to our old multiplication sum and do it modulo 876543. This means that we are allowed to discard multiples of 876543 whenever we like, and we take full advantage of the permission.
If you have a suspicious mind, you can check that this equals the remainder after dividing the original result of the multiplication (97418882688) by 876543. Basically all that has happened is that we've subtracted a multiple of the modulus R (876543 in this case) at every stage in the multiplication. In each case, we've subtracted the biggest multiple of R that we could. Apart from giving us the result we want, this has also kept the numbers shorter. In the old version, we needed to accumulate 11digit numbers, but now we only need 7 digits for temporary results and 6 digits for the final ones: not such a dramatic difference here, but in the fullsized case this means 300 or 301 digits instead of 600 of them. Can adders subtract? Yes. One even better thing: when we come to make all this into a chip, it turns out that with a little extra circuitry (essentially, doubling the number of adders) we can do the subtraction at the same time as the addition that follows it, so that making the multiplication modular ends up costing us no time at all. The cost of the nocost solutionIs this the perfect solution? Unfortunately not. There is a hidden assumption in this algorithm. When we're deciding what multiple of the modulus to subtract, we need to know the value of the number we're subtracting it from. That's obvious. But the key property of carrysave integers (and the thing that makes them fast) is that you don't know their exact value, not unless you pause and let all those carries trickle through. The whole advantage that we got from carrysave integers was precisely the fact that we didn't need to wait to find out their value, we could just carry on calculating with them anyway and wait till the end to find out the result. If we now have to pause and check the value at each stage, in order to find out what to subtract, we lose all the benefit and end up with a very slow chip. This was the situation when Ernie Brickell in the USA and I in the UK came up, more or less simultaneously, with an idiotic idea. It's so idiotic that it deserves a section to itself. If you can't be certain, guess!The idea was this: if you can't be certain (because of carrysave integers) what multiple of R to subtract, make a guess. If you guessed wrong, you may need to do some extra subtracting later or you may need to add back something that you shouldn't have subtracted in the first place. What makes things worse is that the error doubles in each succeeding step (if you look at my examples, you'd expect the error to be multiplied by 10, but this is because I'm using decimal arithmetic because it's more understandable, and a chip would use binary instead). A chip whose errors double at every tick of its clock needs careful control if it isn't to get out of hand, and at first sight there's no real reason to suppose that the errors will be controllable at all. It took quite a lot of simulation and some careful finetuning of the rules before I could be sure that the guess would always be good enough to keep the errors under control. Ernest F. Brickell, "A Fast Modular Multiplication Algorithm with Applications to Two Key Cryptography", in Advances in Cryptology: Proceedings of CRYPTO '82, Plenum, New York (1983), pp. 5160. Brickell's algorithm is documented in his paper, which was presented at Crypto '82 and published the following year. It uses a close relative of carrysave integers, called delayedcarry integers. I didn't read actually read this paper at the time: I was busy turning my own algorithm into a chip, FAP4, which became the world's first commercially available publickey encryption chip. My algorithm is described here for the first time. It is different from Brickell's and simpler: with some chip architectures it may take only half the silicon of Brickell's design. The basic structureFull technical details in: "A New Method of Serial Modular Multiplication". The backbone of the FAP4 design is a sequence of cells, each of which handles one digit. To handle 1024bit numbers, connect 1024 cells together; to handle 512bit numbers, connect 512 cells. Each cell of a chip for calculating A×B (mod R) contains
The operation of the chip is governed by a clock. At the start of a clock cycle, each cell takes the operands from ACC and performs the necessary additions: for instance, it might calculate ACC+BR. When enough time has passed for the additions to be complete, the clock cycle ends. The results of the additions are stored in ACC, replacing what was stored there previously. The results are not stored in the same cell they came from but in the next cell to the left, which means that we have a builtin doubling with every clock tick. For a 1024bit array, the clock ticks 1024 times to do a multiplication; for a 512bit array, it ticks 512 times. In addition to this sequence of cells, we need a controller to control them all. This decides what the next calculation should be: whether B needs to be added, whether R needs to be subtracted. It sends out signals to all the cells telling them what to do. The available technologiesThere were three basic ways of building a chip to your own design, although some variants exist. Chipsmaking is very much like colour printing: the equivalent of the printing plates for each colour are "masks" that determine where certain processes should take place: etching, diffusion, metal layers, etc. What we think of as transistors and logic cells are simply what happens when particular kinds of silicon are adjacent and connected by metal in particular ways. Full custom design means that you design the exact layout of every layer yourself. This gives perfect control of everything, which makes for the most compact possible layout, but it does also require a great deal of expertise: you need to know how wide a strip of metal needs to be to carry a particular signal, how long that signal takes to propagate, how big a transistor should be to behave the way you want and how far away it has to be from other transistors. Full custom design is also expensive because every single mask has to be designed and created for your design. Cellbased design is a packaged version of full custom design. The manufacturers of the chip work out what pattern of masks gives the best adder, the best multiplexer, the best flipflop, the best logic gate, and you work with these elements instead of with the raw physics: the manufacturer's specification tells you how big each element is and how fast it operates. This makes the design easier and less of a risk, but it is still expensive because once again all the masks have to be created specially for your chip. Gate arrays are premanufactured in bulk and then finished off for each customer. In its premanufactured form, a gate array resembles a "sea" of transistors, each one complete in itself but none of them connected to any of the others. To create logic elements these uncommitted transistors need to be connected together with "wires" of metal: in practice, these are layers of metal deposited onto the chip under the control of a mask. Once again, you get a recipe book that tells you what logic elements are available, how much space they take and how fast they are. The big advantage of gate arrays is that only two (or three) masks have to be designed and built for your project. All the other masks, and all the other manufacturing processes, can be shared among all the projects that use this particular type of gate array. Not only does this make gate arrays cheaper, but it also makes them faster to build, since the chips can be partmanufactured in advance long before your own project even starts. In recent years fieldprogrammable gate arrays (FPGAs) have also appeared. These take the gate array concept one step further by dispensing with customdesigned metal layers altogether: everything is done by programming instead. Each cell includes a few bits of storage, and those bits tell the cell how to behave: effectively, what logic element it should be. The cells are more complex, which means that you can get fewer of them on a chip. The chips themselves are inherently more expensive, but on the other hand you can buy them in ones and twos instead of in thousands and you can program them in seconds rather than waiting for weeks for them to arrive from the manufacturer. The purpose of building FAP4 was to demonstrate that the design worked and to have a product that was usefully faster than software, so we chose to use gate arrays. How many chips?It would have been nice to fit a whole 512bit or even 1024bit array onto a single chip. In 1984 this might conceivably have been possible if we had gone for a fullcustom approach, but gate arrays fell far short of the capacity we needed. The gate arrays we chose (from Fujitsu) came in various sizes, but the largest reasonable size could only fit about 40 copies of our backbone cell. For comparison: looking at an arbitrarily chosen contemporary line of gate arrays (NEC's uPD658xx, first made available in 1993 but still sold in 2003), a single unit of the FAP4 backbone occupies between 55 and 80 cells. The chips are available in a range of sizes, and translating them into FAP4 terms, their capacity ranges from 100 to 7000 bits: far more than would ever be needed. We chose to pack 32 cells to a chip and to build a board that held 16 chips, to make a processor that could do 512bit arithmetic: this was adequate for research purposes and is still a commonly used size today. The FAP4 boardThe FAP4 board was designed to fit into a PC. It contained sixteen FAP4 chips, a few interface chips, and a 512bit RAM chip to hold "A" in the calculation "A×B (mod R)". This was awkward but it had to be done to save space on the chips themselves. Doing the designThe conventional way of designing gate arrays is to use a suitable design program draw the logic of the chip on your screen, probably copying circuit diagrams that you already had (most gate array designs started as replacements for existing logic circuitry). The program can then test the design for you by simulating the behaviour of the chip and it can write out the logic specification to a file that you send to the manufacturer. This was not at all what we wanted. We had no existing drawings to work from and the cell structure was so simple, and the pattern so repetitive, that a program could generate it. So we negotiated with Fujitsu and got them to agree that we could submit the design to them as a userunfriendly list: something like "cell 149 is of type N2N, it takes its input from cells 148 and 8860 and sends its output to cell 169" repeated over and over. I came up with the idea of writing the chip itself as a program. Each logic element was one line of that program (written in IBM Macro Assembler for the IBM PC). Depending on how you told the Assembler to handle these lines, you ended up with a program that worked just the way the chip worked (so that testing it was the same as testing the chip) or a program that you could run to output the design in the format that Fujitsu wanted. This design technique was cheap and fast, it was a great success and I can heartily recommend it. What happened nextThe rest is undramatic. We built the board and it worked (for the technically minded, we ran it at 5MHz which meant that it did a single exponentiation in 100ms or so). We sold quite a few of them, but the next stage (the full custom chip) never happened. Partly this was because we had got very busy with other things and partly it was because the market for fast RSA was smaller than had been thought; but there was also the problem that ordinary microprocessors were getting faster and faster. By now an ordinary PC can do publickey encryption faster than the specialised FAP4 board could do it 20 years ago. But in servers, or dedicated cryptographic accelerators, FAP4 is still a viable design, and more economical than some of the alternatives. 
