Crossing Over

Recently I’ve been doing a lot of work at a very low level. Not “should I use a primitive type or a class” low, but more like “how many cycles will this take?” low.

Today I was chatting with some folks on IRC, and the subject of binary searching came up. Now, I don’t know if this is going to surprise you, but in the last few years there has been some “movement” in the performance arena as far as thing we just “know”. It turns out, for example, that the “qsort” function isn’t the fastest horse in the race any more. And it turns out that bsearch is getting a little long in the tooth, too.

“Mon dieu! How can this be?,” you ask. Well, it’s templates.

The C++ guys apparently listened to Philip Greenspun, and decided that the way to better performance was to build a lisp interpreter into the compiler. That lisp interpreter is called “template metaprogramming” (TMP), and the languages like C++ that support TMP can use it in some surprising ways.

The key thing to realize here is that there’s a trade-off involved. The old C standard library functions are – no question – about as fast as they can be. But the TMP guys came to the realization that the library functions were just that: functions. And so calling qsort or bsearch means passing in a callback function that takes a couple of pointers and compares whatever they point to.

This is very generic, but it comes at a fairly high cost in terms of run-time. Calling a function to make a comparison is a huge performance lose, especially on a fast CPU with a good sized cache. All that stuff you’ve heard about cache misses and branch prediction and what-not? Well, it becomes relevant if you have to call through a pointer to the compare function, and it gets doubly relevant if your compare function turns around and calls some other function – like strcmp.

So the template version of all this uses the same logic as the hoary old C standard library chestnuts, but it uses templates to dynamically build a special-purpose function for doing just exactly the Quicksort or binary search that you need to do right now.

The obvious down side? Each place where you invoke qsort or bsearch generates a different function. Big code bloat here, folks.

The up side? Breaking through the “call a function to compare two items” barrier. For small types – like int, float, etc. – and for “fully equipped” types like strings, there will be enough code laying around that the compiler can in-line some, if not most or all, of the comparison logic. Comparing strings will always require looping through arrays. But comparing two structs? That’s easy to inline. Comparing two ints? Please!

Crossing Over

So with this in mind, I set out to find the “cross over” point. Put simply, where do the performance curves of linear seach (lsearch) and binary search (bsearch) cross? That is, for N = how many items does it make more sense to use a binary search than to iterate?

Now, here’s what we know about bsearch: it has performance characteristic O( log n ). And we know that lsearch has performance characteristic O( n ). But what does that mean, exactly?

First, every routine has some overhead. When a subroutine is called, you push some information on the stack, initialize some variables, and then do your work. And later, you takes stuff off the stack, free your resources, and return. That’s overhead.

Next, in the process of doing the actual work, you have to perform interim computations. For example, linear search basically requires some mechanism for iterating through the array you are searching. And that’s going to be a pointer or an integer or something. Updating that iterator is part of the structural cost of the algorithm. A bsearch algorithm typically involves computing (low + (high-low)/2). That’s the middle point for the search, and that’s a part of the structural cost.

Those structural costs can be high, sometimes. So saying that an O( log n ) function is faster than an O( n ) function automatically implies a caveat, so long as n is big enough that the structural and overhead costs get washed out. And that’s why I was curious about the cross-over number. Because with templates, the call-a-function-through-a-pointer structural cost is gone. And so my understanding of where the lines cross is no longer valid.

Great! Another ‘D’ in programming…

The language in question isn’t C++, though. It’s D. Now, D is a lousy name for a programming language, just like C is. Because Google is not your friend if you’re searching for “D”. Nor is “file.d” going to be a win, either, because that’s the default name for dependency files in a lot of build systems. So D wasn’t the best possible choice of names, and it can be hard to find info about it. To save you some grief, have a look at http://www.digitalmars.com/d – that’s the “official” D site, if there is such a thing. (For myself, I’ve taken to using “PL/D” as an abbreviation.)

PL/D version 2.0 supports templates. So here’s bsearch, as a can-find type function:

bool bsearch( T, alias cmp = "a < b" )(T[ ] array, T key )
{
	int left = 0;
	int right = array.length;

	while( left + 1 < right ) {

		int middle = ( left + right ) / 2;
		T am = array[ middle ];

		if( binaryFun!( cmp )( key, am ) ) {

			right = middle;
		}
		else {

			left = middle;
		}
	}

	// return a[0] == key
	return ! binaryFun!( cmp )( key, array[ 0 ] )
		&& ! BinaryFun!( cmp )( array[ 0 ], key );
}

Now I know that this version doesn't do early exit. But it's a little more readable to folks who might be new to the whole template thing, especially with PL/D. One thing I like more about D than C++ is that templates are instantiated like foo!( T )( a, b ) instead of foo< T >( a, b ). If nothing else, it's more html friendly.
I won't explain all the neat-o features of the language here - see the link above for that. I will point out that this version generates a binary function using "a < b" by default, and generates a direct call to that function in the comparisons. As a result of the direct (instead of indirect - through a pointer) function calls, the compiler can inline them, resulting in the comparison being made right in the function itself.
The testbed was an array of randomly-generated integers. The driver code looked like this:

	// ========================
	Thread.sleep( 1 );
	timer.start();

	foreach (iter; 0 .. ITERATIONS )
		bsearch(a[], cast(int)(rand() % ELEMENTS));

	timer.stop();
	writefln( "Bsearch: %f usec (avg)", cast( float ) timer.microseconds / ITERATIONS );
	writefln( " .... Cycle: %f usec (avg)", ( timer.microseconds - overhead ) / ITERATIONS );

	// ========================
	Thread.sleep( 1 );
	timer.start();

	foreach (iter; 0 .. ITERATIONS ) {
		int key = rand() % ELEMENTS;
		csearch( &key, &a[ 0 ], ELEMENTS, int.sizeof, &compare_ints );
	}

	timer.stop();
	writefln( "Csearch: %f usec (avg)", cast( float ) timer.microseconds / ITERATIONS );
	writefln( " .... Cycle: %f usec (avg)", ( timer.microseconds - overhead ) / ITERATIONS );

Where most things should be obvious, but I’ll point out that the Thread.sleep calls were for 100 nanoseconds each – just enough to give control back to Windows so it hopefully wouldn’t interrupt the loops. The “csearch” call in the second block is actually a call to the bsearch function in the C standard library – I created an alias for it since I already had a bsearch function.

Overall, for ELEMENTS = 1000, the cycle time for the template based bsearch was about 0.12 usec. The cycle time for the call-a-function C library version was about 0.165 usec. So for trivial comparisons, the template-based version gives a 25+% speed benefit – that’s why I was interested in running the tests!

Obviously, this number depends on the number of comparisons. For ELEMENTS = 100, you’re looking at log n = 7 instead of 10 comparisons – csearch should catch up a little bit. For ELEMENTS = 100,000, it would be 17 instead of 10 and presumably the template version would show even better performance.

The Edge of Seventeen

But that’s not why I started this. I wanted to know how well the template bsearch code would do against lsearch – linearly scanning the array. It seemed like a safe bet that 1,000 elements would be a win for the binary search…and it was. (whew!) But would the same be true for 100 elements? 10?

Well, for me, the number is 17. If ELEMENTS = 17, then lsearching the array (using a template based lsearch) is faster than bsearching the array (using a template based bsearch). (On the other hand, template based lsearch crosses over C-style bsearch at 61 items!)

Those performance numbers depend on a lot of things. Most importantly, they depend on the relative speed of your comparison code versus the structural code of the search function.

What does that mean? Well, take a look at the a program called git-bisect. This program is a shell script that uses a binary search to find a build failure. Bisect knows the list of changes that were made to your source code, so it can pull out a version, build it, test it, and decide whether the problem is in that version or elsewhere.

The issue, though, is that the “comparison function” involves pulling out your 30 million lines of code, running a build, and then running the test that identifies whether the bug is in this version or not. The build and test process takes so much longer than the “add two numbers and divide by two” of the binary search that it’s a no brainer – you’re almost always going to use the bsearch version.

But when the compare is a single CPU instruction, it’s a different story. In that case, the overhead of the bsearch pushes the crossover point far out past where it is in theory. If your compiler is smart enough, and when you’ve got enough memory to allow it to inline these kinds of functions…well, we have to draw a line somewhere.

Leave a Reply