Why not 2 stacks?

I was googling around the other day for compiler wisdom when I stumbled on a computer security “research” paper. Some one or ones were postulating some theoretical result about vulnerability to buffer overflow attacks in blah, blah, blah special case. I didn’t read it, because that’s not what I was after.

But it made me think: how come Microsoft hasn’t just solved this problem already? Two “stacks” would wipe out the overflow attack problem at minimal cost.

Back Story

If you don’t know, here’s a very (!) short summary of the problem.

Computer software uses “subroutines” to do everything. In fact, even the “main” program is a subroutine. Because subroutines can call other subroutines, they can “stack up”. That is, “main” can call “sell ticket” which can call “print schedule” which can call “draw table”, etc.

What’s a stack?

In modern systems, there is no really good way to keep track of how many routines will get stacked up, a general-purpose data structure called a “stack” is used. This is so important that CPU’s explicitly support a single “call stack” with special CPU instructions like push and pop, special registers to access the stack, and all kinds of incredibly expensive stuff.

A stack works by “stacking” data, in much the same way they stack plates at a buffet restaurant (or cafeteria). The only data item you can get to is the one at the “top” of the stack. (The only plate you can take is the one on the top of the stack. See?) If you want to put on more data, it goes on the top of the stack, and the stack grows. If you want to take off data, you take it off the top of the stack, and the stack shrinks.

In this case, though, there are two ways to think about what “data” means. One way is to think of data as being individual units. One number goes on the stack, then another number goes on top of it, etc. But another way is to think of data “sets” going on the stack. If you think about the back button on your web browser, it acts like a stack. You click on a link, and a page gets pushed on to the “history” in the browser. You click “Back” and a page gets popped off the stack, leaving a different page visible in the browser. But these pages aren’t just a single number. There’s a URL, the page contents, maybe some form data, etc.

In terms of computer subroutines, each subroutine may have some “local data” that it needs to store. The URL of a web page, or the name of the current user, for example. And this data is expected to be used within the subroutine, maybe passed down into any sub-subroutines that get called, and then it will disappear when the subroutine finishes.

So a subroutine may have a bunch of local data, much like a web page in your browers. That can also be managed as part of a stack, using what is called a stack frame. A stack frame is a collection of data that all gets jammed onto the stack at one time, so the subroutine can use it. Again,”modern CPUs” (pretty much all of them built since the 1970’s) support this directly.

How important is the stack?

The de facto standard computer in the world today is the Intel x86 CPU of some kind: Pentium, Opteron, whatever you want to call it. Back when dinosaurs roamed the earth and nobody thought a 16-bit computer was worth the extra money over a 8-bit computer, the x86 family had about 30,000 transistors on board. Now, when people are arguing the relative merits of 32 versus 64 bit computing, the transistor count is about a billion.

The x86 family of computers work using “registers.” A register is a place in the center of the CPU where a number can be held while other stuff gets done to it. If you want to add two numbers, you use a register. If you want to store the result, you probably use another register. They are absolutely crucial to the design of the x86, and most other widespread CPUs. (There are alternative architectures: stack based and accumulator based CPUs. But nobody buys them except for special applications.)

The x86 registers back in the time of the 8088 and 8086 computer were called:

  • AX The accumulator register, best for math.
  • BX The base address register, best for pointer references.
  • CX The counter register, best for loop counts.
  • DX The data register, used for multiply/divide operations, I/O operations, and indirect addressing.
  • SI The source index register, for memory-to-memory block operations, like copying data.
  • DI The destination index register, used for memory block operations, like SI.
  • BP The base pointer register, used to point to the base of the stack frame.
  • SP The stack pointer register, used to point to the current “top” of the stack.

There were some additional registers, like IP, flags, CS, DS, ES, and SS. But they weren’t (and still aren’t) general purpose registers, can’t participate in computations, and generally don’t get modified much by programs. Because the registers listed above are general purpose registers, they can be used for things not listed above. For example, if you need to store a temporary intermediate result while you are computing some other value, you can always jam it into the DI register. (or SI, or DX, or whatever you like). But not BP or SP.

So back in the day, when we had to carve our own computers out of rocks using nothing buy a bronze chisel and a wooden mallet, BP and SP were given over to the stack frame. That’s two registers out of eight — 25%, for you liberal arts majors — assigned to do the job of keeping track of the stack.

And now?

When the 80×86 family went 32-bit, the registers changed. The 32-bit registers were called “extended” registers, and so the names became: eax, ebx, ecx, edx, edi, esi, ebp, esp.

Now, with the advent of the 64-bit x86 family, the names have changed again: rax, rbx, rcx, rdx, rsi, rdi, rbp, rsp. But in addition, AMD added 8 new registers to the x64 architecture, and Intel followed suit.

This means that if you’re using a 32-bit CPU, or a 64-bit CPU in compatibility mode, you’re using 8 registers, still, and 2 of them are dedicated to the stack, still. (25%, still, for you liberal arts majors.) If you’re using a 64-bit CPU in 64-bit mode, then you’ve got 16 registers (that’s 12.5% for all you Art History grads).

So the stack frame, and the stack pointer, and this whole stack “thang” are pretty important.

What’s a buffer overflow attack?

So there’s a stack, and your subroutines use it. The most obvious way they can use it is to store a “return address” on it. If you have a subroutine called sub_1, it might have the stack set up like this:

    :
  ????
  ????
  1234 < -- SP

Where the SP (stack pointer) points to 1234. What's 1234? Well, it's the return address -- an address is a location in computer memory -- where it will send program control when it finishes doing whatever it is doing.

And if the sub_1 routine calls a sub_2 routine, the stack will look like:

    :
  ????
  ????
  1234
  1299 < -- SP

When the CPU executes a call instruction, it pushes the return address (of the instruction right after the call) onto the stack automatically. (And yes, on Intel platforms the stack always goes down instead of up. Don't ask why, it doesn't matter.)

When the sub_2 routine returns control to sub_1, it pops the stack and uses the address that it gets to branch back.

    :
  ????
  ????
  1234 < -- SP
  1299

Of course the 1299 is still down there -- it doesn't get erased. Instead, the SP adds 4 (4 is the size of a return address on a 32-bit [4 byte] CPU) and we go on. Very efficient, very fast, and very automatic with all the special instruction support.

Stack Frames, redux

When the program has local data it needs to store, the situation gets a little more complex. The caller pushes a return address, then the callee (the local subroutine) moves the stack pointer a little bit more, to make room. If sub_2 needs 12 bytes of storage, the preamble of the subroutine looks like this:

   push(ebp)
   ebp = esp
   esp = esp - 12

A preamble is a standard set of code that gets executed at the top of every function. It is generated automatically by the compiler — programmers don’t have to think about inserting it.

When it comes time to clean up and leave the subroutine, the corresponding code is the postamble and it does something like:

   esp = esp + 12
   pop ebp
   return

(In fact, this is wrong. But the effect is the same and it’s easier for you to understand.)

Before the subroutine is called, the stack looks like:

    :
  ????
  ????
  1234   < -- SP

Then when the call happens, the stack looks like:

    :
  ????
  ????
  1234
  1299  < -- SP

Then when the preamble runs, we get:

    :
  ????
  ????
  1234
  1299
   BP'  < -- BP
   ?
   ?
   ?     <-- SP (12 bytes = 3 x 32bit words of "stack frame" storage)

Note that the new BP is pointing to the place on the stack where the previous BP was stored.

So?

So there's a stack frame, and it's a little complex, and it's sitting there on the stack. And in the middle of that stack is the "local data" the subroutine needs.

Sometimes the "local data" includes what is called a "buffer." A buffer is just a space to store a bunch of characters in a row. For example, if the program is going to ask me for my name and address, it needs a place to store however-many characters go into a name and address.

If the buffer is 12 bytes long, I could enter my name as "Austin" and there would be not problem. But if I entered "Austin Hastings" (14 letters, with the space) the buffer would not be long enough to store the name. The problem is that if I am allowed to just enter any old thing, I might deliberately do something destructive. Suppose we are using our previous example:

    :
  ????
  ????
  1234
  1299
   BP'  < -- BP
   ?
   ?
   ?     <-- SP (12 bytes = 3 x 32bit words of "stack frame" storage)

If I type my name as "Austin Hastings", the buffer overflows, like this:

    :
  ????
  ????
  1234
  1299
  ngs   < -- BP
  asti
  in H
  Aust  <-- SP

The old value of BP gets wiped out, and so the calling subroutine will have a bunch of bogus data (because it accesses all the local variables relative to where BP is pointing).

But if I type my name as "Austin Hastings.1215" look what happens to the return address that is supposed to point to the subroutine that called us:

    :
  ????
  ????
  1234
  1215    -- look here!
  ngs.   < -- BP
  asti
  in H
  Aust  <-- SP

See? The return address got changed by me typing in a special, too-long name. This is called a "buffer overflow attack", and if done right it can cause the subroutine to return control to a new program, written by the attacker.

Why can Microsoft solve the problem?

First, Microsoft writes Windows, the most attacked system in the world. And it writes the Microsoft C/C++ compiler, which is the tool of choice for compiling on Windows. Since anything Microsoft does will get copied by the Linux and Apple guys, this is a no-brainer. If Microsoft solves the problem, everyone will copy them and solve the problem for their own sites.

Okay, how can Microsoft solve the problem?

The solution to the "attack" part of the problem is to use two stacks. There are two registers already, and one of them rarely gets used (SP). So split them into two different stacks, and you eliminate the ability of an attacker to overwrite the return address (by separating the stack frame data from the "just-a-stack" data).

Details

Here's what the previous example would look like with two stacks.

  "Frames":          "Stack":
      :                1234
    ????               EBP(1)  < -- SP(2)
    ????  <-- BP(2)    1299
    ....               EBP(2)  <-- SP(3)
    ....
    ....  <-- BP(3)

Notice that I've drawn the buffer as dots, and whatever was in the caller's local data as question marks. Notice also that the important data -- the frame pointers and return addresses -- are in a totally different place from the stack frame data. This means that no buffer overflow attack can reach them.

That's it?

Yep. That's it.

Mind you, it's an expensive it. But not a horribly expensive one. Microsoft, and the GCC folks, and Intel, and anybody else that writes a compiler, would have to come up with a transition plan to take us from one system to another. It would probably easier for Microsoft, since they can deliver their system (Windows) all at once. This recompile would have to happen for almost every problem, so it might take a while for total coverage. But the "most attacked" programs are things like email servers, web servers, the PHP interpreter, etc. There is a pretty short list of those, and getting them secured might take an hour and a half, if everyone would work together.

Any problems with this idea?

First, it doesn't just magically make everything work. Some solutions, like W(+)E, try to do that. If you can just "solve it in hardware" then the problem goes away without messing around with that 23-year-old program that the boss uses but nobody has the source for.

Second, this approach doesn't require anybody to buy a new computer. (But keep reading!) So unless Microsoft forces an upgrade with Windows 7 or 8 or whatever (and when have they ever forced an upgrade?) there's no money in it for the CPU vendors.

Also, don't forget that this only protects against buffer overrun attacks that try to force code execution. It won't protect you from corruption of data that is in a stack frame buffer or local variable. Someone might try to rewrite their account balance, or change your password, or whatever.

Checksum the Stack Frame

There's a solution for that, too. And it could involve new hardware. Just create a "checksum" value for the stack frame, and put it right at the end of each subroutine's frame. That way, if an attacker overruns a buffer it will destroy the checksum.

Obviously I don't mean a real checksum. What I mean is something unpredictable, unlikely to occur at random, and easily checked. Most of the registers wouldn't work, since their values are generally predictable. It would make sense to create a dedicated register for this on the CPU -- which would also allow Intel to make some new hardware -- and push the "random" value on both the frame stack and the SP stack (for checking).

If you're so damn smart, why aren't you rich?

I used to do a lot of assembly programming. So I've always kind of wondered why this wasn't obvious to everybody else. And it may well be that there's some critical flaw in the logic. Probably having to do with interrupt-proofing the two-stack transition.

But maybe there's not. Maybe it's just obvious to me. In which case, I hereby demand that anyone who implements this must call it "Austin's Device". And if you implement the checksum idea, that shall be "Austin's Other Device."

But there's an old joke about an Economics professor and one of his grad students walking across the quad. And the grad student says, "Look, professor! There's a $100 bill lying on the ground."

To which the professor replies, "Don't be absurd. If there were really a $100 bill lying on the ground, someone would pick it up."

Maybe you know what's wrong with my clever device? If so, please let me know.

Leave a Reply