SoC drawer: The Cell Broadband Engine chip: High-speed offload for the masses

摘自: IBM developerWorks Worldwide  被阅读次数: 582


yangyi 于 2007-05-14 21:57:04 提供


Level: Introductory

Sam Siewert (Sam.Siewert@Colorado.edu), Adjunct Professor, University of Colorado

17 Apr 2007

Cell Broadband Engine™ (Cell/B.E.) chips are leading the broadband revolution in computing and provide the core silicon DNA for supercomputing, medical image processing, and many emergent applications, as worldwide connectivity and bandwidth continue to change the world we live in. This article explores the performance of application code on the Sony® PLAYSTATION 3®'s Cell Broadband Engine system running Yellow Dog Linux®. A simple program demonstrates how multithreaded applications that use the Synergistic Processing Elements to offload work can enjoy tremendous speedup.

This article provides an overview of installing and using Yellow Dog Linux on the Sony PLAYSTATION 3 (PS3) to explore the capabilities of the Cell/B.E. processor. The PS3 provides an amazing low-cost platform for multithreaded data- and compute-intensive applications. It's very accessible and easy to program, and fun, too.

In this article, I'll use the Cell/B.E. SDK to build a benchmarking application using POSIX threads that are mapped onto the processor's Synergistic Processing Elements (SPEs) over the Element Interconnect Bus (EIB). (See the Resources section below for a link to the SDK.) The Cell/B.E. processor is an advanced SoC design that provides 205 GFLOPS of performance running at 3.2GHz with an symmetric multithreaded Power Processing Element (PPE) and up to eight SPEs that can be used for offloading work. On the PS3, one SPE is dedicated to the built-in Sony GameOS, also known as the hypervisor, and one SPE is disabled to increase PS3 yield and lower cost. So, on the Yellow Dog Linux PS3 platform used in this article, the PPE has six SPEs to which it can offload over the EIB.

I'll walk you through some Pthreads-based code that you can download and use to compare Cell/B.E. processor performance to most other multicore or symmetric multithreaded architectures. I think you'll find that, compared to most other options, the Cell/B.E. processor provides truly amazing offload performance at low cost and power -- especially on the PS3.

Finally, I'll discuss the capabilities that the Cell/B.E. processor brings to systems and its potential for use in embedded or large, scalable clustered systems. The amazing capability of the Cell/B.E. processor will undoubtedly revolutionize many emergent applications in broadband, graphics, and high-performance computing (HPC), and is leading the way in SoCs joining mainstream computing.

A few notes on getting your PS3 system going

It is easy to get Yellow Dog Linux running on the PS3; however, I would like to point you to the resources I used and note a few insights that helped make my install painless and fun. Also, please note that the developerWorks Power Architecture technology zone includes many great PS3 Linux articles that helped me as well. (See the Resources section for links.)

  • Yellow Dog Linux (YDL): This Linux (2.6.16 kernel) distribution for Power Architecture™ systems from Terra Soft Solutions makes running Linux on PS3 simple and includes a great guide for installation on the PS3. (See Resources.)

  • An HDMI-to-DVI-D or component cable for HDTV: YDL and the Enlightenment X Window manager is best enjoyed at 1080p (1080 progressive scan resolution) on an HDTV. If you have an HD monitor or TV, I highly recommend getting the right HDMI, component, or HDMI-to-DVI cable so you can work in a windowed environment. (See Resources.)

  • Install from DVD and use a thumb drive: The bootloader installer and bootloader (kboot) are most easily installed from a thumb drive that you can plug into one of the many USB 2.0 ports on the PS3. Likewise, YDL can be purchased or downloaded and burned onto a single-layer DVD for simple Linux install. I found the thumb drive to be a great way to transfer code as well, because the wireless 802.11g interface is not yet available in YDL and I don't have Gigabit Ethernet down by my HDTV (yet). I was able to simply pop in the thumb drive in YDL and mount it with the following command:

    mount -t msdos /dev/sdf1 /mnt/thumb
    

  • Don't worry about GameOS: When I was installing Linux, it was clear that the PS3 was designed to accommodate the installation of a second OS, and offers protection mechanisms to prevent installation glitches from harming GameOS. GameOS is really a hypervisor that also manages the second OS boot and install and provides fail-safe options along with kboot to restore your PS3 back to its shipped configuration. The 60GB PS3 provides up to 10GB for GameOS and almost 50GB for Linux (or vice versa).



Back to top


Jumping into SPE offloading enthusiastically

Yellow Dog Linux runs on many Power Architecture platforms

YDL runs on most Power Architecture platforms, including Apple G3, G4, and G5 machines; the Sony PS3; and embedded and HPC systems from IBM and Mercury. For example, Mercury has a 1RU dual Cell/B.E. system with dual Gigabit Ethernet and PCI-e expansion slots. At the time this article was written, YDL includes Version 2.6.16 of the Linux kernel, which corresponds to Fedora Core 5 and SUSE 10.0; it will undoubtedly be upgraded soon to the 2.6.18 kernel to match the recently released Fedora Core 6 and SUSE 10.2. For those interested in the embedded and HPC aspects of Cell/B.E. systems, this means that you can easily develop code at home on a low-cost, easy-to-use Linux platform that is instruction-set-architecture compatible with many of the new and exciting Cell/B.E. HPC and embedded systems.

The whole reason I was interested in getting a PS3 to run Linux was to see just how well the SPE offload in a Cell/B.E. system worked. Okay, I wanted to justify HD-quality Madden Football, too, but I'll guess that the SPE offload is what you're most interested in. Playing a game or two on the PS3, it's clear that there's some real compute power under the hood, but writing your own code is truly believing.

The C code found in Listing 1 includes three basic benchmarks, using iterations of a 64-bit Fibonacci sequence:

  • SPE offloading of threads: One to six threads are created at a time so that work is offloaded to all available SPEs by the testThread() function. Upon completion of a set of six threads, the next set of SPE offload threads are begun. This continues until all threads have been executed on the SPEs. The Linux gettimeofday() function is called to time this segment of code, including completion time for all SPE threads started.

  • POSIX Pthread threads: This segment follows the SPE thread test and creates a POSIX thread for each Fibonacci series calculation working in a loop until all are created and active, and then waits in a loop for all to finish. Again, gettimeofday() is called to time the segment.

  • Sequential iteration: The Fibonacci sequence code is called iteratively multiple times to match the number of workers that were executed in the SPE and Pthread tests above.

Find the full source for this code in the Download section of this article. The code and makefile found in src1.zip is the first version of the code. It was written using POSIX threads only; src2.zip contains a revised version of the code that includes SPE offload. The exact code run, including the makefile, is provided. To build the PS3 SPE offload version, use make spetest; to build the simple Pthreads version, use make. The Pthreads and sequential test will run on any Linux platform, but the SPE test will only run on a PS3 Linux installation.

A note on Pthreads scheduling and compiler optimization

testThread() itself is a Pthread that runs using first-in, first-out (FIFO) policy at rtmax-1 so that it won't be interfered with during the testing. All Pthreads created for the test are run FIFO policy at rtmax priority so that they won't be interfered with at all other than by critical system interrupts. This careful Pthread scheduling leads to much more deterministic and repeatable test runs. Furthermore, the workload code implementing the Fibonacci sequence has been carefully designed so that it can not be over-optimized and will yield a repeatable workload for each threading method tested.


Listing 1. Pthread used to evaluate sequential, PPE threaded, and SPE threaded performance
                

void *testThread(void *threadid)
{
#ifdef SPE
   fibdata fdarray[MAX_NUM_THREADS] __attribute__((aligned(16)));
   speid_t spe_id[6];
#endif
   double DT=0.0;
   int i, j;
   int numThreads = (int)threadid;
   int threadAlloc;

#ifdef SPE

   for(i=0;i<MAX_NUM_THREADS;i++)
   {
       // Initialize data to send to each SPE
       fdarray[i].idx = 0;
       fdarray[i].jdx = 1;
       fdarray[i].seqCnt = seqIterations;
       fdarray[i].iterCnt = Iterations;
       fdarray[i].fib = 0;
       fdarray[i].fib0 = 0;
       fdarray[i].fib1 = 1;
       fdarray[i].padding = 0;
   }

   // SPE thread benchmark
   startTOD=readTOD();
   for(i=0;i<numThreads;i+=6)
   {
       if((numThreads-i) <= 6)
            threadAlloc=6;
       else
            threadAlloc=(numThreads-i);

       for(j=0;j<threadAlloc;j++)
       {
       spe_id[j] = spe_create_thread(0, &fib_spe_handle,
                                     &fdarray[i], NULL, ANY_SPE,
                                     NO_OPTIONS);
           if(spe_id[j] == 0) exit(-1);
       }
       for(j=0;j<threadAlloc;j++)
           spe_wait(spe_id[j], NULL, 0);

   }
   stopTOD=readTOD();

   DT = elapsedTOD(stopTOD, startTOD);

   printf("SPE Thread: Fib(%u)=%llu in %lf secs for Iter=%u\n",
          (seqIterations*Iterations), fdarray[0].fib, DT, fdarray[0].iterCnt);

#endif


   // Pthread benchmark
   startTOD=readTOD();
   for(i=0;i<numThreads;i++)
       pthread_create(&threads[i], &rt_sched_attr, fibSeq, (void *)i);
   for(i=0;i<numThreads;i++)
       pthread_join(threads[i], NULL);
   stopTOD=readTOD();

   DT = elapsedTOD(stopTOD, startTOD);

   printf("Threaded:   Fib(%u)=%llu in %lf secs for Iter=%u\n",
          (seqIterations*Iterations), finalFibVal[0], DT, Iterations);

   // Sequential benchmark
   startTOD=readTOD();
   for(i=0;i<numThreads;i++)
       fibSeq((void *)i);
   stopTOD=readTOD();

   DT = elapsedTOD(stopTOD, startTOD);

   printf("Sequential: Fib(%u)=%llu in %lf secs for Iter=%u\n",
          (seqIterations*Iterations), finalFibVal[0], DT, Iterations);

   }

The test code used to benchmark the speedup provided by the six SPEs available on the PS3 uses the Cell SDK to embed code and to pass data to the SPEs through the EIB. The fibdata data structure is passed down to an SPE by the PPE along with the code that is embedded through the fib_spe_handle. It has been well noted by the designers of the Cell/B.E. architecture that some atypical coding constructs must be used to employ the SPEs -- the method used to embed and download code along with data used in this example is the main coding paradigm that differs from typical threading. The makefile uses spu-gcc, an SDK compiler, to build the SPE downloadable code, which is provided in Listing 2. Furthermore, once the SPE code has been generated, it is embedded into an ELF (Executable and Linking Format) object code file and incorporated into the main program through the fib_spe_handle.

Following this procedure is simple, but, on the downside, the parameters passed in with the code must be 16-byte aligned; they are also not type checked or otherwise checked to ensure consistency between the two fbdata declarations in Listing 1 and Listing 2. Programmers should take care to make sure agreement between the types, alignment, and structure definition is correct, because any disagreement won't be caught by the compiler and will lead to a runtime error that will be harder to debug. The worst that can happen is a bus error or parameter mismatch, but the cause of such a problem may not be immediately obvious, so double-check declarations shared between PPE and SPE.


Listing 2. The Fibonacci workload program for the SPEs
                
#include "spu_mfcio.h"

typedef struct
{
    unsigned int idx;
    unsigned int jdx;
    unsigned int iterCnt;
    unsigned int seqCnt;
    unsigned long long fib;
    unsigned long long fib0;
    unsigned long long fib1;
    unsigned long long padding;
} fibdata;

int main(unsigned long long spe_id, unsigned long long fibdata_ea, unsigned long long env)
{
   fibdata fd __attribute__((aligned(16)));
   int tag_id = 0;

   // Read in fibdata
   mfc_get(&fd, fibdata_ea, sizeof(fibdata), tag_id, 0, 0);
   // wait for data
   mfc_write_tag_mask(1<<tag_id);
   mfc_read_tag_status_any();

   // Compute sequence requested
   for((fd.idx)=0; (fd.idx) < (fd.iterCnt); (fd.idx)++)
   {
      fd.fib = fd.fib0 + fd.fib1;
      while((fd.jdx) < (fd.seqCnt))
      {
         fd.fib0 = fd.fib1;
         fd.fib1 = fd.fib;
         fd.fib = fd.fib0 + fd.fib1;
         (fd.jdx)++;
      }
   }

   // Write back fibdata
   mfc_put(&fd, fibdata_ea, sizeof(fibdata), tag_id, 0, 0);
   // wait to complete
   mfc_write_tag_mask(1<<tag_id);
   mfc_read_tag_status_any();

   return 0;

}


The code in Listing 2 is downloaded to each SPE and fibdata is passed in through the SDK mfc_put() and mfc_get() calls in the SPE code. Note that the SPE code is written as a new main program and code is generated for it prior to embedding through the handle into the PPE code. In this example, there is one version of the SPE code used by all SPEs, but a unique copy of fibdata is used for each thread instance with fdarray.

I first completed testing using the Pthreaded code just to take a look at the speedup provided by the PPE symmetric multithreading to the Fibonacci worker threads. If you download and run this code, the acceleration provided by the PPE SMT becomes apparent when the Fibonacci sequence is run for thousands of iterations or more. Basically, there is a point at which the overhead of thread creation and management is overcome by the speedup gained from threads being executed with SMT acceleration.

I further tested this simple Pthread code on the PPE to see how it scales with an increasing number of threads. In general, the PPE SMT provides a constant speedup that is significant. Looking at PPE SMT-based speedup as a ratio of sequential time taken divided by Pthread time taken for each thread set will show you how speedup varies with number of threads. Speedup is fairly constant as threads are scaled. The PPE itself provides significant thread scaling, but it is intended to provide control and workload management for the SPEs, which provide much greater speedup for threads.

You can best achieve huge performance advantages on Cell/B.E. systems by downloading thread code onto SPEs. You might expect that six SPEs would provide a speed improvement of about a factor of six (minus the overhead of code download and message passing), but I was pleasantly surprised to find an even greater speedup on my system. I suggest that you download the code and give it a try on the PS3 or any other Cell/B.E. system to measure speedup. I only tested the code on YDL on the PS3, but I would expect it to work on just about any Cell/B.E. platform that runs Linux. POSIX threads have been designed to be portable and the SPE offload uses the Cell/B.E. SDK.

When the SMT of the PPE and pipelining is employed using the SPEs, I found that the speedup was greater than I thought it might be on my system. The EIB allows the PPE to start downloads on multiple SPEs and to overlap their starts and stops very efficiently so that the entire process is fully pipelined. Acceleration is also provided on the SPEs by vector processing features not found on the PPE, including 128-bit wide vector processing. So, not only can the PPE efficiently start overlapping execution on all SPEs, but the code would be expected to execute faster thanks to vector processing as well. The test code provided in this article will reveal the true power of Cell/B.E. processing for threaded applications, and to some extent, the vector processing capabilities of the SPEs. Cell/B.E. processors also include AltiVec accelerating operations for graphics, image processing, and digital signal processing that will further improve performance for applications that can employ these instructions. The code provided in this article is limited to integer operations. In future articles, I plan to take a closer look at floating point workloads.



Back to top


The original and true promise of broadband to our planet

The core that won back supercomputing honors

In September of 2004, the Japanese NEC Earth Simulator system was dethroned by the U.S. Blue Gene®/L system as the world's faster supercomputer. The Cell processor (closely related to Cell/B.E. chips) helped bring this honor back to the U.S. with the rollout and power on of Blue Gene/L. Blue Gene/L once again took honors on the SC06 Top 500 list with 280.6 TFLOPS.

For me, using the PS3 to run Linux brings back memories of the early days of computing, when the world was filled with the promise of silicon-based revolution. Maybe it's the programming on a TV that reminds me of the early computers. Better yet, Cell/B.E. technology has been designed to help realize the full potential of Web-based networked computing with high-end graphics and video. The World Wide Web in its early days was seen as revolutionary in that it would surely lead to work at home, less travel, less commuting, less pollution, more global communication, and a flat world with fair e-commerce, and would ultimately serve to provide more efficiency and fairness in the world as a whole -- well, at least a few idealists like myself thought this. Cell/B.E. technology is exciting because the PS3 may be the lowest-cost, highest-performance computer ever provided to the general public. Sure, the PS3 is a costly game platform, but it can do far more than play games. The idea of Cell/B.E. chips as the DNA of the fastest computer available to the masses and the core of the fastest computer period (Blue Gene/L) is truly promising. With broadband transport becoming widely available to most of the world, a multicore processor to make good use of it has now finally also been introduced.

As the world faces issues like global warming and political rifts, the emergence of new technology that can help us work more effectively at home, minimize grueling business travel, communicate better, have more fun, and get excited about computing again is a welcome sight. I have to wonder: Do current estimates of potential reductions in greenhouse gasses take into account the extent to which broadband might reduce commuting and global travel for business? Emergent new applications like telemedicine, effective high-definition video conferencing, and true virtual presence can help change the shape of things to come for the better. Either way, Cell/B.E. technology sure does help my threads run faster.



Back to top


Conclusion

The PS3 is a great and relatively low-cost way to explore and evaluate the capability of Cell/B.E. technology -- and it's fun, too. It can also serve as a development platform for work on Cell-based HPC or embedded software and has the ability to serve as a great Linux platform at home. While it is short two SPEs compared to HPC/embedded Cell/B.E. platforms, it can host the same SDK and be used quite readily to develop SPE offloading code. It will be interesting to see how many PS3s wind up running Linux -- I suspect all will also be used as the game platforms they were intended to be, but the idea of a game system designed to host Linux from the beginning was in my opinion an excellent decision for both PS3 and Cell/B.E. technology. While Cell/B.E. chips and the PS3 may not solve global warming, they will keep a few people off the road and at home nights and weekends.




Back to top


Downloads

DescriptionNameSizeDownload method
Sample code written using POSIX threads onlysrc1.zip32KBHTTP
Revised version of code that includes SPE offloadsrc2.zip36KBHTTP
Information about download methods


Resources

Learn

Get products and technologies


About the author

Dr. Sam Siewert is an embedded system design and firmware engineer who has worked in the aerospace, telecommunications, and storage industries. He also teaches at the University of Colorado at Boulder part-time in the Embedded Systems Certification Program, which he co-founded. His research interests include autonomic computing, firmware/hardware co-design, microprocessor/SoC architecture, and embedded real-time systems.




原文链接: http://www-128.ibm.com/developerworks/power/library/pa-soc12/?S_TACT=105AGX54&S_CMP=NLLX