GPUs vs Cell

Blogged under Cell by Barry Minor on Wednesday 30 November 2005 at 7:39 pm

Recently I came across a link on www.gpgpu.org that I found interesting. It described a method of ray-tracing quaternion Julia fractals using the floating point power in graphics processing units (GPUs). The author of the GPU code , Keenan Crane, stated that “This kind of algorithm is pretty much ideal for the GPU - extremely high arithmetic intensity and almost zero bandwidth usage”. I thought it would be interesting to port this Nvidia CG code to the Cell processor, using the public SDK, and see how it performs given that it was ideal for a GPU. First we directly translated the CG code line for line to C + SPE intrinsics. All the CG code structures and data types were maintained. Then we wrote a CG framework to execute this shader for Cell that included a backend image compression and network delivery layer for the finished images. To our surprise, well not really, we found that using only 7 SPEs for rendering a 3.2 GHz Cell chip could out run an Nvidia 7800 GT OC card at this task by about 30%. We reserved one SPE for the image compression and delivery task. Furthermore the way CG structures it SIMD computation is inefficient as it causes large percentages of the code to execute in scalar mode. This is due to the way they structure their vector data, AOS vs SOA. By converting this CG shader from AOS to SOA form, SIMD utilization was much higher which resulted in Cell out performing the Nvidia 7800 by a factor of 5 - 6x using only 7 SPEs for rendering. Given that the Nvidia 7800 GT is listed as having 313 GFLOPs of computational power and seven 3.2 GHz SPEs only have 179.2 GFLOPs this seems impossible but then again maybe we should start reading more white papers and less marketing hype.

27 Comments »

  1. Comment by Juice — December 1, 2005 @ 9:46 am

    What cell hardware did you use?

  2. Comment by Hank — December 1, 2005 @ 10:26 am

    That is not considering the unused GPU power dedicated to texture processing and other parts unused in this particular example. Also I’d be curious to know what the specs of the Cell machine are (is it the IBM blade, the Mercury blade, the PS3 development platform?).

    It is not enough to blindly throw numbers like this.

  3. Comment by Luq Harith — December 1, 2005 @ 3:10 pm

    I’ve been hearing about the PS3 Cell chip able to output 1080p graphics with HDR just by itself. Also, the E3 2005 PS3 Getaway demo is said to be rendered entirely by Cell without the help of a GPU. Can you clarify any of these news?

  4. Comment by Marco — December 2, 2005 @ 8:10 am

    I believe the GPU implementation has low performance due to dynamic branching.
    Probably there’s a very poor coherence between pixels being shaded in the same batch (as you probably know G70 fragment shading engine is SIMD, not MIMD).
    While working on the CELL implementation have you someway addressed branching penalties? (removing branches, adding branch hints, etc..)

    thank you

  5. Comment by Barry Minor — December 2, 2005 @ 12:21 pm

    Juice,

    I used a UP Cell bringup system.
    3.2 GHz DD3.1 Cell processor with 8 good SPEs, 512 MB XDR Memory, 100 Mb network.
    Pixels were 128 bit, 32 bit float per color channel.
    All rendering parameters were the defaults set by the Cg program with the exception of the window size which was increased to 1024×1024.

  6. Comment by Barry Minor — December 2, 2005 @ 12:45 pm

    Hank,

    I have never seen a paper describing the distribution of floating point power in modern GPUs, or an explanation of the GPU’s microarchitecture. Maybe you can give me a pointer to such a paper. Given that the program only runs on the pixel shaders, the vertex shaders are being bypassed removing 34.4 GFLOPs, this leaves 278.6 GFLOPs for this program. How Cg maps this program to the GPU pixel shader HW is unknown to me.

    Cell system config is listed above.

  7. Comment by Barry Minor — December 2, 2005 @ 1:51 pm

    Luq,

    We have rendered real-time 1080p images with our Terrain Rendering Engine (TRE) using only Cell. All of our framebuffers are 128 bits per pixel. I don’t know anything about the “Getaway” demo so I couldn’t comment on how it was implemented.

  8. Comment by Barry Minor — December 2, 2005 @ 2:08 pm

    Marco,

    I have heard that the dynamic branching in the current Nvidia GPUs maybe causing performance degradations in the ray-tracer and that the ATI X1800 may address some of these issues. If anyone has access to an X1800 I would be interested in hearing its 1024×1024 frame rate.

    No I didn’t modify the code structure (removing branches, unrolling loops, etc) when porting it to Cell. Yes this could be done but I wanted to preserve the code structure so it would be a fair comparison and a simple conversion that any tool chain could achieve. Branch hints were added by the compiler and I didn’t add any __BUILTIN_EXPECTs to the code.

  9. Comment by Edison — December 2, 2005 @ 9:34 pm

    hmm, what different between the DD2 CELL and DD3.1 CELL ?

  10. Comment by version — December 3, 2005 @ 2:28 am

    3.1 cell processor , what is it?

  11. Comment by Marco — December 3, 2005 @ 3:54 am

    G70 AFAIK processes pixels 256 quads (2×2 pixels) in lockstep, if only a pixel in a batch takes a different path, that extra path is evaluated on the other 1023 pixels too, then the resulst is simply discarded. All the pixels in a batch take as much time to be shaded as the ’slowest’ pixel.
    ATI X1800 processes pixels in 4 quads batches, so it should theoretically be more efficient than G70.
    Unfurtunately I don’t know if dynamic branching is really killing G70 performances here.

  12. Comment by Padraig — December 3, 2005 @ 7:17 am

    Very interesting stuff indeed. Good job! I wonder what the speedup would be like if branch-avoidance and loop-unrolling were used (or in other words, if the app was built from the ground up for Cell). Quite amazing that a simple port would result in such improvements!

    On questions about “The Getaway” and the like, more generally I’d be curious as to Cell’s performance with more GPU-like polys-and-textures style rasterisation. The TnL figures released on ibm.com give us a little indication, but I wonder how good Cell is at rasterisation. I expect a GPU would outperform it here quite a lot, but it’d be interesting just out of curiousity. For example in PS3, I could see room for Cell to some rasterisation in parallel with the GPU, before blending their frames together.

  13. Comment by Donovan — December 3, 2005 @ 12:23 pm

    DD3.1, eh, Barry? Was that a slip or are the revisions actually up to 3.1?

    If possible can you shed some light on differences between it and DD2? I have a feeling it’s hush-hush at this point, but asking never hurts?

    Interesting results, nonetheless.

  14. Comment by Laurent Lessieux — December 3, 2005 @ 3:31 pm

    That seems quite promising, it would be very interesting to see the code if possible :)

    I hope to port some of our code to the cell soon and if i am able to reproduce this kind of performance the cell will become a great platform for us too.

  15. Comment by Tim Chambers — December 7, 2005 @ 7:31 pm

    Don’t forget that the 7800GT OC is 20pipe x 425MHz = 8500MTexels/s; the 7800GTX 512MB is 24pipe x 550MHz = 13200MTexels/s. So you could scale the GPU score by about 1.55x if ideal. Maybe the cg shader code for the GPU could be better optimised ? The are difinite limitations of the GPU shaders and better flexibility of the Cell’s SPE’s.

    http://www.xbitlabs.com/articles/video/display/g70-indepth_3.html
    “Each of the two shader units now has an additional mini-ALU (these mini-ALUs first appeared back in the NV35, but the NV40 didn’t have them). It improves the mathematical performance of the processor and, accordingly, the speed of pixel shaders. Each pixel processor can execute 8 instructions of the MADD (multiply/add) type in a single cycle, and the total performance of 24 such processors with instructions of that type is a whopping 165Gflops which is three times the performance of the GeForce 6800 Ultra (54Gflops). Loops and branching available in version 3.0 pixel shaders are fully supported.”

    http://www.rojakpot.com/default.aspx?location=3&var1=88&var2=0

    Need another look at the documentation of the architectures to work out what’s going on.

  16. Comment by Barry Minor — December 8, 2005 @ 4:48 pm

    Edison, Version, Donovan,

    Chip revisions are a standard part of developing processors. The public will only see the final revision. Most of the changes between DD2 and DD3 were to improve yeild. Some protype systems floating around outside our lab still have DD2 parts. I have not seen any performance differences between the two parts with my code.

  17. Comment by Barry Minor — December 8, 2005 @ 4:58 pm

    Marco,

    My SOA, fast, code for Cell has the same problem but to a lesser degree. I process four rays at a time in lock step so all four rays pay the price of the longest of the four paths. However this is still much better than what you’re describing within the GPU.

  18. Comment by Barry Minor — December 8, 2005 @ 5:27 pm

    Padraig,

    Thanks I’m glad you found the data interesting as well.

    There are probably 10-15% more tuning type improvements that we passed over so as to preserve the code structure.

    The more interesting improvements could come about by the benefit of using a general purpose architecture to optimize the algorithm in a smarter way. For example by approaching the problem differently, less brute force, we can take advantage of the fact that rays missing the fractal are faster to compute than rays that hit the fractal. We can also adaptively super-sample the object, thereby improving the final image quality, without over sampling the areas that don’t need it (background and areas of low contrast). As we are seeing from the responses brute force processors aren’t agile enough to take advantage of such adaptive optimizations.

    As for Cell rasterization, we to are interested in knowing where Cell falls relative to a specialized processor. More on this later….

  19. Comment by Barry Minor — December 8, 2005 @ 5:32 pm

    Laurent,

    I actually wrote the code for exactly that purpose. We just need to make sure we aren’t stepping on anyones copyrights. I hope to make both the AOS and SOA Cell versions available soon.

  20. Comment by Barry Minor — December 8, 2005 @ 6:55 pm

    Tim,

    Thanks for the info. When I went shopping for a card the only 7800s I saw were the GT. None of the retailers seem to stock the GTX cards (to new? to expensive?).

  21. Comment by Barry Minor — December 8, 2005 @ 7:25 pm

    Tim,

    By the way the frame rate I saw on the 7800 GT OC was 3-4 fps with a window size of 1024×1024. If you have access to a GTX card please post the 1024×1024 frame rate.

  22. Comment by Marco — December 10, 2005 @ 4:50 pm

    Interesting, processing a quad in lockstep is clever and probably it’s small cost one has to pay in order to be way more efficient in the vast majority of cases (we don’t like images made out of white rumour…).
    It’s nice to hear that your team is writing a sw rasterizer. I believe a SPE can be a good rasterizer if it can work on a local problem (tiled rendering?) and on untextured primitives.

  23. Comment by Padraig — December 22, 2005 @ 7:57 am

    Thanks for your comments, Barry. I look forward to any findings re. rasterisation on Cell!

    I was looking at the Cell performance benchmarks at http://www-128.ibm.com/developerworks/power/library/pa-cellperf/?ca=drs-, and wondered about the transform and lighting benchmark - do you know if the polys were being rasterised after transformation and lighting in that app?

    I’d be particularly interested in seeing how Cell performs with blending - multiplicative, additive etc. Cell’s role in rasterisation would likely be a complementary/helper one, in a system with a GPU present, so I’m wondering if something like transparency rendering and blending could be outsourced to the CPU to be done in parallel with the GPU’s rendering, and then blended with the final frame rendered by the GPU. I think it could be done, with a shared z-buffer, but I’m wondering if the performance would be there to make it really worthwhile. If it is, it could be a big win - it seems particularly attractive because you could be eating Cell’s generous internal bandwidth rather than precious external memory bandwidth as the GPU would do. I’m thinking about things like particle systems - it could be neat to generate, simulate and then render alpha-blended quads all on Cell, with relatively little memory bandwidth expenditure versus a traditional approach with GPU rendering.

  24. Comment by Erwin Coumans — July 17, 2006 @ 12:15 pm

    IBM released the sample source code of this Julia sample, including the efficient 4-way associative spu-cache. You can download the SDK for free here http://www.alphaworks.ibm.com/tech/cellsw

    If you don’t have time to install the full Linux toolchain, and just want to see the sample source, the fact that it is open source allows redistribution, so here you go: http://www.continuousphysics.com/ftp/pub/test/physics/source/ibm_cellsim_1.1_samples_src.zip

    Thanks IBM!
    Erwin

  25. Comment by dz — December 11, 2006 @ 10:53 am

    Could someone try it with a G80? Would be interesting to see if that one could outperform the Cell.

  26. Pingback by Using your ps3 for more useful things :) - Page 2 - Emuforums.com — November 5, 2007 @ 12:00 am

    […] and it beats the crap out of a G70 GPU in terms of raw processing power, as proven by someone. GameTomorrow » GPUs vs Cell Article! And anyway, it’s no coincidence that the Cell BE broke the processing record in […]

  27. Comment by Jebus — February 16, 2008 @ 9:10 pm

    Your comparing a micro processor to a gpu, why??

    Besides the Cell processor is a technology thats 10 years old, the fact Sony decided to implement it into their console doesn’t mean it can stack up against todays computer hardware. In fact IBM itself has iterated that the Cell processor is only capable of reaching theoretical speeds, speeds which do not inherently reach the climax of the Cell claim.

    Do you really think IBM will sell you something for $400 - $600 bucks that can out do most available hardware??

    Anyone smell a rat??

RSS feed for comments on this post. TrackBack URI

Leave a comment

Check Spelling
Activate Spell Check while Typing
The postings on this site solely reflect the personal views of the authors and do not necessarily represent the views, positions, strategies or opinions of IBM or IBM management.

GT design based on the Identification theme for Wordpress by neuro.