Cell Can’t Texture?

Blogged under Cell, Consoles by Barry Minor on Friday 24 March 2006 at 12:14 pm

Much has been said about Cell’s presumed inability to texture map well. Given the small (256KB) local stores and DMA memory access, the SPEs were relegated by many to only handle nice streaming geometry type workloads. This seemed like an issue ripe for a little prototyping.

First, colleague Mark Nutter, implemented a software cache abstraction layer for the SPE giving us the ability to both hide the complexity of DMAs and benefit from transparent data reuse. Next, given the lessons learned from this paper, we tiled our textures, optimized our access patterns, and implemented several cache replacement policies. We then rewrote the shader in the Quaternion Julia Set Raytracer to add five cubemap texture lookup passes - 3 refraction lookups, a reflection lookup, plus a background lookup. These five texture lookups were then blended together with a fresnel calculation and modulated with the base lighting computation to form the final sample color.

The results were very pleasing.

Sample Frame

Quicktime H.264 movie (16MB so be patient)

We found that even with small 4-way set associative software cache sizes (8 KB), miss rates for this renderer were a low 7% and hit access times were only 12 SPE cycles.

Graph

Using only seven 3.2 GHz SPEs we were able to raytrace 15 frames per second with a frame resolution of 1024×1024. The texture buffer held a cubemap with 1024×1024x16 bit texel faces resulting in a 12.5 MB texture buffer in XDR system memory. The performance penalty for using the five pass texture shader vs the lighting only shader was just 13%.

Our miss handler was implemented as a blocking function and we still have ideas pending to further reduce the 12 cycle software cache hit access time so we believe the 13% performance gap between the two shaders will continue to close.

5 Comments »

  1. Comment by Peter Kennedy — March 26, 2006 @ 1:23 pm

    Since it’s not my thing, I have zero idea how well your results compare to what might have been expected, or to other CPUs and GPUs (a point of reference might have been useful for comparison for the ignorant amongst us! ;)), but the software cache performance you describe sounds very impressive. I think I read previously about a software cache implementation on SPUs with a hit latency in the 20-cycle range, so 12-cycles is very impressive indeed.

    Do you think the software cache you have implemented could be applied more generally to other tasks, for example, traversing a large tree, or is it very specifically tuned to texturing? I guess one of the advantages of doing it yourself is you can optimise it for your workload, but I wonder what cache performance for tree traversal (for example) could be like with similar tuning.

    Very encouraging work, by the way. Oh, and on Cell rendering, it might interest you to know that PS3 developers would appear to already be using Cell for rendering in tandem with the GPU - I saw at GDC, for example, that the makers of Warhawk are using a “software volumetric ray-tracer” on some SPUs to render the clouds in the game, mixing the software rendering on Cell with RSX’s hardware rendering. It’s exciting stuff!

  2. Comment by Barry Minor — March 27, 2006 @ 2:09 pm

    Thanks Peter,

    To get a feel for how fast this code runs on other systems download the GPU version of the program (without the texturing) from:

    http://graphics.cs.uiuc.edu/svn/kcrane/web/project_qjulia.html

    and resize the window to 1024×1024.

    Yes the software cache can be applied to other problems than texturing. Our version was tuned to texturing in the sense that accesses were to tiled data structures and the cache line size was adjusted to match the tile size. In general, if you are using the SPEs to access large data structures with complex access patterns and you know there is data reuse than the software cache is a good option to try.

    It’s good to hear that there are game companies already getting the SPEs in the rendering loop. I look forward to seeing their results.

  3. Comment by Barry Minor — March 27, 2006 @ 3:45 pm

    When comparing the performance of this texturing shader to the non-texturing GPU shader in the link above several things need to be taken into account. While the GPU is very fast at texture lookup it still needs to compute the 4 texture coordinates (3 refract + 1 reflect) and a fresnel value. All of these add cycles to the pixel shaders and will therefore slow down the GPU’s overall performance at this task.

  4. Comment by Chris Thornborrow — April 2, 2007 @ 3:16 am

    Thats a very beautiful demo of your technique, lovely to see that movie.

    In your example application, you are using a single texture to do 5 texture lookups per pixel. Further, the accesses to this texture are likely to be close (ie into the same block of tiled texture) in four of the cases I think (am I right here?). Next the raytracer will be heavyweight for GPUs today, as this is definitely not their sweet spot in terms of performance, conflicting for resource with texture lookup calculations on the GPU.

    It seems to me therefore you have very cleverly chosen a specific example that would optimise results in your favour rather than present a general case from which to draw strong conclusions.

    How then would your results change when compared to a GPU if infact you were using 5 different textures? Or if the GPU were doing standard rasterisation of triangles, rather than raytracing?

    Nontheless, like I said, great demo, really nice movie.

  5. Comment by fais — December 17, 2007 @ 4:34 pm

    wow.. i was trying to figure out how to over come the 256k LS limitation on the SPE to perform texture mapped triangle rasterization. I knew i would need to implement some sort of software cache scheme… good to know it is possible :D

RSS feed for comments on this post. TrackBack URI

Leave a comment

Check Spelling
Activate Spell Check while Typing
The postings on this site solely reflect the personal views of the authors and do not necessarily represent the views, positions, strategies or opinions of IBM or IBM management.

GT design based on the Identification theme for Wordpress by neuro.