As Ars Technica announced three days ago,  Intel’s 2009 launch of its ambitious Larrabee GPU’s has been canceled : “The project has suffered a final delay that proved fatal to its graphics ambitions, so Intel will put the hardware out as a development platform for graphics and high-performance computing. But Intel’s plans to make a GPU aren’t dead; they’ve just been reset, with more news to come next year“.

I can’t wait for more news about that radical new architecture from the only major graphics hardware vendor that has a long history of producing or commissioning open source drivers for its graphics chips.

But what are we excited about ? In a nutshell : automatic vectorization for parallel execution of any known code graph with no data dependencies between iterations is why Larabee is about. That means that in many cases, the developper can take his existing code and get easy parallel execution for free.

Since I’m an utter layman in the field of processor architecture, I’ll let you read the word of Tim Sweeney of Epic Games, who provided a great deal of input into the design of LRBni. He sums up the big picture a little more eloquently and I found him cited in Michael Abrash’s April 2009 article in Dr. Dobb’s – “A First Look at the Larrabee New Instructions” :

Larrabee enables GPU-class performance on a fully general x86 CPU; most importantly, it does so in a way that is useful for a broad spectrum of applications and that is easy for developers to use. The key is that Larrabee instructions are “vector-complete.”

More precisely: Any loop written in a traditional programming language can be vectorized, to execute 16 iterations of the loop in parallel on Larrabee vector units, provided the loop body meets the following criteria:

  • Its call graph is statically known.
  • There are no data dependencies between iterations.

Shading languages like HLSL are constrained so developers can only write code meeting those criteria, guaranteeing a GPU can always shade multiple pixels in parallel. But vectorization is a much more general technology, applicable to any such loops written in any language.

This works on Larrabee because every traditional programming element — arithmetic, loops, function calls, memory reads, memory writes — has a corresponding translation to Larrabee vector instructions running it on 16 data elements simultaneously. You have: integer and floating point vector arithmetic; scatter/gather for vectorized memory operations; and comparison, masking, and merging instructions for conditionals.

This wasn’t the case with MMX, SSE and Altivec. They supported vector arithmetic, but could only read and write data from contiguous locations in memory, rather than random-access as Larrabee. So SSE was only useful for operations on data that was naturally vector-like: RGBA colors, XYZW coordinates in 3D graphics, and so on. The Larrabee instructions are suitable for vectorizing any code meeting the conditions above, even when the code was not written to operate on vector-like quantities. It can benefit every type of application!

A vital component of this is Intel’s vectorizing C++ compiler. Developers hate having to write assembly language code, and even dislike writing C++ code using SSE intrinsics, because the programming style is awkward and time-consuming. Few developers can dedicate resources to doing that, whereas Larrabee is easy; the vectorization process can be made automatic and compatible with existing code.

With cores proliferating on an more CPUs every day and an embarrassing number of applications not taking advantage of it, bringing easy parallel execution to the masses means a lot. That’s why I’m eager to see what Intel has in store for the future of Larrabee.