### About Global Documents

**Global Documents provides you with documents from around the globe on a variety of topics for your enjoyment. **

**Global Documents utilizes edocr for all its document needs due to edocr's wonderful content features. Thousands of professionals and businesses around the globe publish marketing, sales, operations, customer service and financial documents making it easier for prospects and customers to find content. **

Experiments in Scientific Computation on the PlayStation 3

Sevre, Erik O.D.; Christiansen, Monica D.; Broten, Matt; Wang, Shuo M.;

Yuen, David A.

Dept. of Geology and Geophysics and Minnesota Supercomputing Institute,

University of Minnesota, Minneapolis, MN 55455, U.S.A.

Submitted to Visual Geosciences, May, 2008.

Abstract

The Sony PlayStation 3 (PS3) offers the computational power of a parallel processor at low cost, which

makes it a great starter unit for development in parallel programming. To explore the capabilities of the unit,

we took a simple ray tracing program and extended it to render triangulated height field data across the PS3’s

6 SPUs (synergistic processing units). We also implemented the heat averaging equation as a precursor to

CFD analysis on the PS3. In our studies, we found the Cell engine in the PS3 to be a powerful machine,

however great care must be taken while developing because its unique platform calls for many levels of

optimization to ensure efficiency.

1. Introduction

In the past two years, there has been a lot of discussion about the new Cell Broadband Engine (Cell BE)

microprocessor. The simple reason for this is that parallelized architectures, such as the Cell, are the future of

scientific computation. This is because they enable a higher volume of calculations to be performed.

However, due to the cost of building a Cell computer, simply constructing a complete system for use as a test

bed is impractical. The Sony PlayStation 3 (PS3), on the other hand, provides an ideal environment to start

developing with its native support for the Yellow Dog Linux operating system. Costing only $600, the PS3

Cell provides 218 Gflops of processing power that can be utilized with a relatively quick and simple

installation.

We wanted to see if the Cell processor lived up to the advertised capabilities. However, we needed to

consider the limitations of the PS3 Cell: it has very limited memory, it has no access to the GPU, and it

performs better using single precision floating point operations. In this paper, we will discuss the

applications we have written for the PS3 Cell processor, the many challenges we found when programming,

and the solutions we came up with to overcome these challenges.

Many of the programs we look at on the PS3 can be accomplished using a Graphics Processor Unit (GPU),

but we have chosen to focus our attention on investigating how the Cell BE works as a computation engine.

The Cell BE is midway between a conventional CPU and a GPU. Each processor has advantages and

disadvantages based on how memory and computations are managed. GPUs excel at graphics, CPUs are

good for general computing, and the Cell BE has been created for core scientific computing. Instead of

comparing the PS3 to a GPU we instead look at how we developed and adapted code for the PS3 and looking

at the problems and advantages we saw with the Cell BE.

2. Hardware description of the PS3

The PS3 features the novel multi-core Cell BE processor. The Cell BE processor is a heterogeneous chip

comprised of nine cores. One core is a general purpose 64-bit 3.2 GHz PowerPC processor referred to as the

Power Processing Element (PPE). The other eight cores are 3.2 GHz single instruction multiple data (SIMD)

processors referred to as Synergistic Processing Elements (SPEs).

Figure 1. A diagram of the Cell architecture in the PlayStation 3.

Each SPE is comprised of a Synergistic Processing Unit (SPU), a dedicated memory flow controller (MFC),

and a 256 KB local store (LS). The LS is high-speed SRAM used to store the SPE's program code and data.

Since an SPE cannot directly access main memory, all read and write operations issued by an SPE are

limited to its LS. Data transfer operations between an SPE's LS and main memory are handled by the MFC.

The MFC allows for data transfer operations to be carried out while the SPE performs other tasks. This

provides the opportunity for data transfer latencies to be hidden using software controlled multi-buffering.

Although the Cell BE processor includes eight SPEs, only six are available for use by applications running

on the PS3. One SPE is disabled to make manufacturing the Cell BE more cost effective. Another SPE is

dedicated to performing support tasks for the virtualization layer (or hypervisor) known as Game OS. Each

SPE delivers a peak of 25.6 Gflop/s single precision performance at 3.2 GHz, so the six SPEs are capable of

a max of 6 x 25.6 Gflop/s = 153.6 Gflops/s.

The PPE consists of the simultaneous multi-threading Power Processing Unit (PPU), a 32 KB

instruction/data L1 cache, and a 512 KB L2 cache. The PPE uses the PowerPC 970 instruction set, so it is

capable of running existing applications compiled for the PowerPC architecture. The PPE can also directly

Reserved for OS

Disabled Backup SPE

address hardware resources and has full access to main memory. Thus, the PPE is ideal for running an

operating system. Although the PPE is a capable processor for computational tasks, its main responsibility is

the distribution and management of tasks for the SPEs.

The Cell BE processor is connected to main memory via the memory interface controller (MIC). In the PS3,

the MIC provides a 25.6 GB/s connection to 256 MB of dual-channel Rambus Extreme Data Rate (XDR)

memory. Approximately 200 MB of this memory is available to applications running on the PS3.

The PPE, SPEs, MIC, and input/output device system are connected via the element interconnect bus

(EIB). The EIB is a four ring data communication structure which supports a total bandwidth of 204.8 GB/s.

Each of the components have a 25.6 GB/s connection to the EIB.

Besides the Cell BE processor and the previously mentioned 256 MB main memory, the PS3 includes a 256

MB nVidia RSX graphics processing unit (GPU), a 60 GB hard drive, a GigaBit Ethernet (GigE) NIC, and a

Blue-ray disc drive. The GPU could be used for computation in concert with the Cell BE processor, but at

present its use is blocked by the hypervisor.

3. Ray tracing

Ray tracing is a visualization technique that uses an eye, a port, an object, and a light source. Part of the

reason that we chose to build a ray tracing application, is that it fit well with the requirements of the PS3.

Ray tracing does not require a high level of precision, while requiring many computations in a short amount

of time. Ray tracing determines if the light source’s rays hit the object for each pixel through the port. This

implies that each calculation of hit or miss can be independent. Each SPE can do a line of calculations, until

the reflections and lightning have been calculated for the entire screen. The resulting image is then returned

to the screen.

Figure 2. An illustration of Ray tracing

Building a well-developed ray tracer is a very involved project, and we intentionally choose not to do this.

This project focused on how the hardware could optimize the ray tracing process so we choose to work on

optimizing the hardware implementation instead of the algorithm. By only optimizing the hardware we can

more closely monitor how much gain we are getting from the Cell hardware, and not worry if speed gains are

from algorithm or hardware.

We started the project using C code for a simple ray tracing application, developed by Eric Rollins. In his

application he only looked at intersections of circles and rays of light. To expand on this we first started

looking at intersections of triangles and rays. Once we were able to see triangles we went on to modify

height field data from tsunami simulations so that they could be visualized. The height field data was large

enough where we had to expand on the data handling, since only a limited amount of data can be stored in a

LS.

4. Tsunami Visualization on the PS3

4.1 PPU Program

The following offers a description of the PPU side program of the ray tracing application. First, float values

are read from a file containing triangulated data to populate an array of triangle structures in main memory.

Each triangle structure represents a single triangle and contains three float vectors. Each float vector

represents a point of the triangle and contains three float values for the x, y, and z coordinates of the point.

Since each float vector is 16 bytes, each triangle structure is 48 bytes. After the array of triangle structures is

filled, a control block structure is initialized for each SPE containing two 32 bit addresses and two integer

values. One address gives the location in main memory of a two-dimensional array of long values, referred to

from now on as the pixel buffer array. This array is used to store pixel color values given as a result of the

ray tracing algorithm. The other address gives the location in main memory to the previously described array

of triangle structures. One of the integers gives a value, referred to here as the line offset value, which

dictates how the workload is split between the SPEs. The second integer is a filler value used to simply

expand the size of the structure to 16 bytes so that it is the correct size for a DMA transfer. After the control

block structures are setup, a thread is created for each SPE using the spe_create_thread function from the

IBM SDK libspe library. The address of the SPE's associated control block structure is passed as an

argument to the spe_create_thread function. On creation of an SPE thread, the respective SPU side program

is started. A description of the workings of the general SPU side program is given later. Finally, after

receiving a mailbox message from the SPEs notifying that all computations were performed, the pixel buffer

array is rendered to an image using Simple DirectMedia Layer (SDL).

4.2 SPU Program

The following describes the working of the SPU side program of the ray tracing application. First, the control

block structure is transferred from main memory to the LS. After this, the LS is filled with 3696 triangle

structures using an efficient transfer mechanism. This is accomplished through 11 DMA operations where

each time 336 triangle structures are copied from main memory to the LS. A DMA operation in which 336

triangle structures are transferred is optimal. This is the case because 336 trianagle structures amounts to 336

x 48 bytes = 16128 bytes; no greater whole number of triangle structures can be transfered such that the size

of the transfer is under 16 KB (the maximum amount of data that can be transfered) and a multiple of 128

bytes (necessary for an optimal transfer time). A total of 11 of these transfers fills the LS. After the LS is

filled with triangle structures, the ray tracing algorithm is executed on those triangle structures, and the pixel

buffer array is updated as necessary. The previous two steps, the transfer of triangle structures and the

execution of the ray tracing algorithm on the triangle structures, is repeated as needed until all of the triangle

structures have been processed. The line-offset value specified in the control block structure dictates which

lines in the final image the SPE is responsible for calculating. This ensures that each SPE is responsible for

doing 1/6 of the total ray tracing calculations. Lastly, the updated pixel buffer array is transferred to main

memory, and a mailbox message is written to the PPE signaling that computations are complete.

Figure 3. A block diagram of the ray tracing program run on the Cell architecture

5. Parallel Layout

The process of developing applications for the Cell BE Processor is more involved than programming for

other common architectures. The design of the Cell BE Processor necessitates an application design

consisting of separate programs for the general purpose PPE and the SIMD SPEs. The program running on

the PPE is comprised of control code for the application. For example, the PPE program initiates the SPE

threads and keeps the SPE programs synchronized. Synchronization and other communication between the

PPE and SPE programs is maintained through a mailbox messaging system. The SPEs are vector processors,

so special care must be taken to vectorize the programs that run on the SPEs. Libraries developed

specifically for the Cell Processor handle vector operations. The SPE programs must explicitly handle the

transfer of data between the SPEs and main memory. Each SPE has a 256 KB LS used to hold the program

and its data. The PPE program has direct access to main memory, but the SPE programs must use DMA

operations to transfer data back and forth from main memory. The limited memory capacity of the SPE's LS

requires the implementation of mechanisms to repeatedly fetch data from main memory during computation.

The DMA operations are handled independently from the SPE core by the MFC. The MFCs allow for data

transfer times to be hidden in situations where computations associated with a chunk of data are more time

consuming than the time taken to transfer the data. Hiding data transfer latencies is accomplished through

explicitly implemented buffering techniques. Such techniques range from simple double buffering to

software-hyperthreading. For the previously mentioned reasons, the implementation of applications for the

Cell Processor clearly requires attention to details beyond those needed in the development of software

systems for other common architectures.

The software development kit provided by IBM facilitated programming for the PlayStation and proved to be

quite helpful is available at http://www.ibm.com/developerworks/power/cell/. The IBM SDK included a

compiler designed specifically for programs written for the cell, as well as an array of functions which

simplified programming. One of the data types available in cell programming is the vector float, which is an

array of floats of size four. The SDK allowed us to perform calculations on the entire vector float, instead of

forcing us to loop through the array and compute it individually. Other functionality that the SDK provides

includes SPE timings and data transfers. However, the SDK did not simplify PPE/SPE communications, and

any data passed between the two needed to be extracted using direct memory access.

Figure 4. A render of the triangulated tsunami data

Figure 5. Above: A 2D depiction of height field data rendered in MatLab. Below: The same height

field data rendered as a 3D image in the PS3 ray caster.

6. Heat Equation on the PS3

As an experiment in parallel programming we tried to choose a simple 2-D problem that we would be able to

try and run in parallel using the SPUs on the PS3's cell processor. The heat equation ( c!2u =

"u

"t

) , where u

is the temperature field and c is the thermal diffusivity,turns out to be pretty simple when you look at the

steady state case where

!u

!t

= 0 . When we solve this equation to figure out how to find u i, j

(

) we find that

we are averaging the surrounding nodes to find the value of the center node. In the figure below we solve the

steady state heat equation by using the recursive equation x =

a + b + c + d

4

at each iteration.

Figure 6. Central differencing cross: Essentially we are taking the average of these points to compute

the next iteration.

The tricky part for parallel computation is in the domain decomposition. We get around this by splitting the

region into layers for each SPU then share the boundaries between each layer. The boundary needs to contain

two nodes to allow the layers to communicate with each other. One of the difficulties we ran into is

answering the question of how to solve the problem of communicating between the nodes. Poor

communication results in impaired performance, so we spent a lot of time looking at efficient ways to

communicate between SPU nodes without using the PPU. To keep things efficient we found it was best for

the SPUs to directly access values from the memory and the PPU would simply be used to coordinate when

data was ready to transfer, instead of actually doing the data transfer.

Figure 7. Domain decomposition between SPUs

7. Set Up of the Driven Cavity Problem

7.1 Overview

Problems in computational fluid dynamics (CFD) may prove to be a better application for the PS3 Cell BE

processor than ray tracing. The nature of such problems, generally, is grid generation coupled with ODE and

PDE solutions through iteration and convergence methods. Thus the problem can be split up and each

processor assigned to a section of the grid. Because CFD focuses more on computation rather than

generating high quality visualization, these problems may be more suited for our experimentation on the

PS3. The driven cavity problem was chosen to test CFD on the PS3. With the particular parameters chosen,

computation on a standard desktop CPU can give a convergence time on the order of 30 minutes, which

leaves room for improvement on a parallel processor.

The driven cavity problem is well known within the field of computational fluid dynamics and provides a

good premise for testing methods as well as studying the effects of input parameters. The problem consists

of a rectangular cavity with an upper plate, which moves to the right at a velocity of uo. This moving plate

drags the fluid to the right and sets up a convection pattern in the cavity. The purpose of this section is to

explain this problem and how we would implement it on the PS3.

7.2 Theory

The governing theory behind the driven cavity problem is the vorticity-stream function formulation in two-

dimensional flow. For two-dimensional flow, a function called the stream function, which satisfies the

continuity equation, is defined by:

!

u =

"#

"y

(1)

!

v = "

#$

#x

(2)

The lines of constant ψ represent streamlines, the difference between the streamlines gives volumetric flow.

To find vorticity, the vorticity transport equation can be derived from the primitive formulation of the

incompressible Navier-Stokes equations and is shown in Equation 3.

!

"#

"t

+ u

"#

"x

+ v

"#

"y

= $

" 2#

"x 2

+

" 2#

"y 2

%

&

'

(

)

*

(3)

Finally streamline values are calculated from the stream function equation, which is derived from the

definition of vorticity along with Equations 1 and 2. That is:

!

" 2#

"x 2

+

" 2#

"y 2

= $%

(4)

From these four equations, the behavior of the fluid in the cavity can be calculated by applying the correct

numerical algorithm.

7.4 Method

The cavity in this problem has dimensions of L by H and the fluid inside the cavity has a kinematic viscocity

of 0.0025 m2/s. The setup for the cavity is shown in Figure 8.

y

uo

j=JM

H

x

j=1

L

i=1

i=IM

Figure 8. The domain of solution for the driven cavity problem

The numerical algorithm for solving the governing equations is to use the FTCS explicit scheme and the

point Gauss-Seidel formulation to solve for the vorticity and the stream function equations, respectively.

The iterative nature of the Gauss-Seidel formulation and the steady state requirements on the vorticity

equation require the use of convergence values for both formulations. A value of 0.001 is set for the

convergence of the stream function and a value of 0.002 is set for the convergence of the vorticity equation.

Table 1 provides a summary of the specific input parameters.

Dimensions and

properties

Grid size

IM = 41

Δx, Δy = 0.01 m

JM = 31

IM = 81

Δx, Δy = 0.005 m

JM = 61

IM = 161

Δx, Δy = 0.0025 m

JM = 121

IM = 321

L = 40 cm

H = 30 cm, 10 cm

uo = 5 m/s

Δt = 0.001 s

ν = 0.0025 m2/s

0.00025 m2/s

0.000025 m2/s

Δx, Δy = 0.00125 m

JM = 241

Table 1. An overview of parameters to be studied, note that some solutions may not be stable enough

to converge.

The FTCS explicit scheme applied finite difference formulation to Equation 3 to obtain,

!

"i, j

n+1

= "i, j

n

+ #t$

"i+1, j

n % 2"i, j

n

+ "i%1, j

n

(#x)2

+

"i, j+1

n % 2"i, j

n

+ "i, j%1

n

(#y)2

&

'

(

)

*

+ % ui, j

n

"i+1, j

n %"i%1, j

n

2#x

% vi, j

n

"i, j+1

n %"i, j%1

n

2#y

&

'

(

(

)

*

+

+

(5)

for which the vorticity at the next time step can be found. The numerical scheme used to solve the stream

function was the Gauss-Seidel formulation which yields the equation,

!

"i, j

k+1 =

1

2(1+ # 2)

($x)2%i, j

n+1 +"i+1, j

k +" i&1, j

k+1 + # 2(" i, j+1

k +" i, j&1

k+1

)

[

]

where

# =

$x

$y

(6)

These solutions are implemented by solving first for the vorticity within the domain which is then updated in

Equation 6 and solved for the stream function. This procedure is repeated until convergence is met. One

detail to note is that the boundary conditions for the vorticity must be updated as Ω is updated.

8. Future Prospects

The unique architecture of the PS3 offers a few optimization methods that could be pursued in the future.

The first method considered utilizes the built in libraries available in the cell processor that are capable of

SIMD mathematical operations. Once implemented, SIMD instructions can operate on 4 sets of data at once.

This is done by vector operations where each entry in a vector of length four can be added, subtracted,

multiplied or divided by the corresponding entries in another vector of length four.

The second method for code optimization is buffering or pipelining. This technique would prompt the SPEs

to load an initial set of data, begin computation on that piece of data while a new set is loaded in. Such an

algorithm would eliminate much of the time wasted by an inactive SPU as illustrated in Figure 9.

Figure 9. A depiction of double buffering

A third method for optimization is implementing MPI (Message Passing Interface) across more than one

PS3. MPI would enable one PPE to send instructions to more than one cell processor. This would increase

the number of SPEs available for use, thus decreasing computation time.

Along with MPI, research into rendering on a platform with a graphics card after completing computation on

the Cell is being considered. This will resolve any visualization issues.

Another prospect for future work is to switch concentration from the ray tracing algorithm to CFD

applications. This is practical because graphics cards can handle most of the issues arising from the ray

tracing algorithm and an algorithm that is capable of running a CFD problem would be of higher value.

Final Remarks

Programming on the PS3 has posed a very interesting and challenging problem to our group and a lot of

ideas and speculation has emerged from our study. Overall, the PS3 is a powerful machine, but much care

must be taken when programming for the Cell BE processor. The architecture of the Cell only allows the

SPU to work on small amounts of data at a time that leads to great amount of data being moved through

memory, and this eventually became a limiting factor in the algorithms used. This limiting factor requires

many levels of optimization in order to bring computation time down to a minimum. Such unique

characteristics bring about a tradeoff between time invested in programming and money invested in an

architecture that is easier to optimize.

Acknowledgements

We acknowledge discussions with Prof. Paul Woodward and help from Spring Liu.

This research was supported by the Middleware grant of IF program of National Science Foundation.

References:

Bellens, P., Perez, J.M., Badia, R.M., Labarta, J., CellSs: a Programming Model for the Cell BE

Architecture, Proceedings of the ACM/IEEE SC 2006 Conference, 2006.

Benthin, C., Wald, I., Scherbaum, M., Friedrich, H., Ray Tracing on the Cell Processor, IEEE Symposium on

Interactive Ray Tracing, 15-23, 2006.

Buttari, A., Dongarra, J. and Kurzak, J., Limitations of the PlayStation 3 for High Performance Cluster

Computing, Tech. rep., Department of Computer Science, University of Tennessee, UT-CS-07-597, 2007.

Buttari, A., Luszczek, P., Kurzak, J., Dongarra, J., Bosilca, G., SCOP3: a rough guide to scientific computing

on the PlayStation 3, Tech. rep., Innovative Computing Laboratory, University of Tennessee Knoxville, UT-

CS-07-595, 2007.

IBM, Cell Broadband Engine Architecture, Version 1.01, 2006.

IBM, Synergistic Processor Unit Instruction Set Architecture, Version 1.2, 2007.

Kurzak, J., Buttari, A., Luszczek, P., Dongarra, J., The PlayStation 3 for High-Performance Scientific

Computing, Computing in Science & Engineering, 84-87, 2008.

William J. Dally, Patrick Hanrahan, Mattan Erez, Timothy J. Knight, François Labonté, Jung-Ho Ahn,

Nuwan Jayasena, Ujval J. Kapasi, Abhishek Das, Jayanth Gummaraju, Ian Buck, "Merrimac:

Supercomputing with Streams", SC2003, November 2003, Phoenix, Arizona.

Williams, S., Shalf, J., Oliker, L., Shoal K., Husbands, P. and Yelick, K., The Potenial of the Cell Processor

for Scientific Computing, CF '06: Proceedings of the 3rd conference on Computing frontiers, 9-20, 2006.

**For insight on early work done on modern stream computing and GPU processing look up the

Merrimac project and Pat Hanrahan.

Sevre, Erik O.D.; Christiansen, Monica D.; Broten, Matt; Wang, Shuo M.;

Yuen, David A.

Dept. of Geology and Geophysics and Minnesota Supercomputing Institute,

University of Minnesota, Minneapolis, MN 55455, U.S.A.

Submitted to Visual Geosciences, May, 2008.

Abstract

The Sony PlayStation 3 (PS3) offers the computational power of a parallel processor at low cost, which

makes it a great starter unit for development in parallel programming. To explore the capabilities of the unit,

we took a simple ray tracing program and extended it to render triangulated height field data across the PS3’s

6 SPUs (synergistic processing units). We also implemented the heat averaging equation as a precursor to

CFD analysis on the PS3. In our studies, we found the Cell engine in the PS3 to be a powerful machine,

however great care must be taken while developing because its unique platform calls for many levels of

optimization to ensure efficiency.

1. Introduction

In the past two years, there has been a lot of discussion about the new Cell Broadband Engine (Cell BE)

microprocessor. The simple reason for this is that parallelized architectures, such as the Cell, are the future of

scientific computation. This is because they enable a higher volume of calculations to be performed.

However, due to the cost of building a Cell computer, simply constructing a complete system for use as a test

bed is impractical. The Sony PlayStation 3 (PS3), on the other hand, provides an ideal environment to start

developing with its native support for the Yellow Dog Linux operating system. Costing only $600, the PS3

Cell provides 218 Gflops of processing power that can be utilized with a relatively quick and simple

installation.

We wanted to see if the Cell processor lived up to the advertised capabilities. However, we needed to

consider the limitations of the PS3 Cell: it has very limited memory, it has no access to the GPU, and it

performs better using single precision floating point operations. In this paper, we will discuss the

applications we have written for the PS3 Cell processor, the many challenges we found when programming,

and the solutions we came up with to overcome these challenges.

Many of the programs we look at on the PS3 can be accomplished using a Graphics Processor Unit (GPU),

but we have chosen to focus our attention on investigating how the Cell BE works as a computation engine.

The Cell BE is midway between a conventional CPU and a GPU. Each processor has advantages and

disadvantages based on how memory and computations are managed. GPUs excel at graphics, CPUs are

good for general computing, and the Cell BE has been created for core scientific computing. Instead of

comparing the PS3 to a GPU we instead look at how we developed and adapted code for the PS3 and looking

at the problems and advantages we saw with the Cell BE.

2. Hardware description of the PS3

The PS3 features the novel multi-core Cell BE processor. The Cell BE processor is a heterogeneous chip

comprised of nine cores. One core is a general purpose 64-bit 3.2 GHz PowerPC processor referred to as the

Power Processing Element (PPE). The other eight cores are 3.2 GHz single instruction multiple data (SIMD)

processors referred to as Synergistic Processing Elements (SPEs).

Figure 1. A diagram of the Cell architecture in the PlayStation 3.

Each SPE is comprised of a Synergistic Processing Unit (SPU), a dedicated memory flow controller (MFC),

and a 256 KB local store (LS). The LS is high-speed SRAM used to store the SPE's program code and data.

Since an SPE cannot directly access main memory, all read and write operations issued by an SPE are

limited to its LS. Data transfer operations between an SPE's LS and main memory are handled by the MFC.

The MFC allows for data transfer operations to be carried out while the SPE performs other tasks. This

provides the opportunity for data transfer latencies to be hidden using software controlled multi-buffering.

Although the Cell BE processor includes eight SPEs, only six are available for use by applications running

on the PS3. One SPE is disabled to make manufacturing the Cell BE more cost effective. Another SPE is

dedicated to performing support tasks for the virtualization layer (or hypervisor) known as Game OS. Each

SPE delivers a peak of 25.6 Gflop/s single precision performance at 3.2 GHz, so the six SPEs are capable of

a max of 6 x 25.6 Gflop/s = 153.6 Gflops/s.

The PPE consists of the simultaneous multi-threading Power Processing Unit (PPU), a 32 KB

instruction/data L1 cache, and a 512 KB L2 cache. The PPE uses the PowerPC 970 instruction set, so it is

capable of running existing applications compiled for the PowerPC architecture. The PPE can also directly

Reserved for OS

Disabled Backup SPE

address hardware resources and has full access to main memory. Thus, the PPE is ideal for running an

operating system. Although the PPE is a capable processor for computational tasks, its main responsibility is

the distribution and management of tasks for the SPEs.

The Cell BE processor is connected to main memory via the memory interface controller (MIC). In the PS3,

the MIC provides a 25.6 GB/s connection to 256 MB of dual-channel Rambus Extreme Data Rate (XDR)

memory. Approximately 200 MB of this memory is available to applications running on the PS3.

The PPE, SPEs, MIC, and input/output device system are connected via the element interconnect bus

(EIB). The EIB is a four ring data communication structure which supports a total bandwidth of 204.8 GB/s.

Each of the components have a 25.6 GB/s connection to the EIB.

Besides the Cell BE processor and the previously mentioned 256 MB main memory, the PS3 includes a 256

MB nVidia RSX graphics processing unit (GPU), a 60 GB hard drive, a GigaBit Ethernet (GigE) NIC, and a

Blue-ray disc drive. The GPU could be used for computation in concert with the Cell BE processor, but at

present its use is blocked by the hypervisor.

3. Ray tracing

Ray tracing is a visualization technique that uses an eye, a port, an object, and a light source. Part of the

reason that we chose to build a ray tracing application, is that it fit well with the requirements of the PS3.

Ray tracing does not require a high level of precision, while requiring many computations in a short amount

of time. Ray tracing determines if the light source’s rays hit the object for each pixel through the port. This

implies that each calculation of hit or miss can be independent. Each SPE can do a line of calculations, until

the reflections and lightning have been calculated for the entire screen. The resulting image is then returned

to the screen.

Figure 2. An illustration of Ray tracing

Building a well-developed ray tracer is a very involved project, and we intentionally choose not to do this.

This project focused on how the hardware could optimize the ray tracing process so we choose to work on

optimizing the hardware implementation instead of the algorithm. By only optimizing the hardware we can

more closely monitor how much gain we are getting from the Cell hardware, and not worry if speed gains are

from algorithm or hardware.

We started the project using C code for a simple ray tracing application, developed by Eric Rollins. In his

application he only looked at intersections of circles and rays of light. To expand on this we first started

looking at intersections of triangles and rays. Once we were able to see triangles we went on to modify

height field data from tsunami simulations so that they could be visualized. The height field data was large

enough where we had to expand on the data handling, since only a limited amount of data can be stored in a

LS.

4. Tsunami Visualization on the PS3

4.1 PPU Program

The following offers a description of the PPU side program of the ray tracing application. First, float values

are read from a file containing triangulated data to populate an array of triangle structures in main memory.

Each triangle structure represents a single triangle and contains three float vectors. Each float vector

represents a point of the triangle and contains three float values for the x, y, and z coordinates of the point.

Since each float vector is 16 bytes, each triangle structure is 48 bytes. After the array of triangle structures is

filled, a control block structure is initialized for each SPE containing two 32 bit addresses and two integer

values. One address gives the location in main memory of a two-dimensional array of long values, referred to

from now on as the pixel buffer array. This array is used to store pixel color values given as a result of the

ray tracing algorithm. The other address gives the location in main memory to the previously described array

of triangle structures. One of the integers gives a value, referred to here as the line offset value, which

dictates how the workload is split between the SPEs. The second integer is a filler value used to simply

expand the size of the structure to 16 bytes so that it is the correct size for a DMA transfer. After the control

block structures are setup, a thread is created for each SPE using the spe_create_thread function from the

IBM SDK libspe library. The address of the SPE's associated control block structure is passed as an

argument to the spe_create_thread function. On creation of an SPE thread, the respective SPU side program

is started. A description of the workings of the general SPU side program is given later. Finally, after

receiving a mailbox message from the SPEs notifying that all computations were performed, the pixel buffer

array is rendered to an image using Simple DirectMedia Layer (SDL).

4.2 SPU Program

The following describes the working of the SPU side program of the ray tracing application. First, the control

block structure is transferred from main memory to the LS. After this, the LS is filled with 3696 triangle

structures using an efficient transfer mechanism. This is accomplished through 11 DMA operations where

each time 336 triangle structures are copied from main memory to the LS. A DMA operation in which 336

triangle structures are transferred is optimal. This is the case because 336 trianagle structures amounts to 336

x 48 bytes = 16128 bytes; no greater whole number of triangle structures can be transfered such that the size

of the transfer is under 16 KB (the maximum amount of data that can be transfered) and a multiple of 128

bytes (necessary for an optimal transfer time). A total of 11 of these transfers fills the LS. After the LS is

filled with triangle structures, the ray tracing algorithm is executed on those triangle structures, and the pixel

buffer array is updated as necessary. The previous two steps, the transfer of triangle structures and the

execution of the ray tracing algorithm on the triangle structures, is repeated as needed until all of the triangle

structures have been processed. The line-offset value specified in the control block structure dictates which

lines in the final image the SPE is responsible for calculating. This ensures that each SPE is responsible for

doing 1/6 of the total ray tracing calculations. Lastly, the updated pixel buffer array is transferred to main

memory, and a mailbox message is written to the PPE signaling that computations are complete.

Figure 3. A block diagram of the ray tracing program run on the Cell architecture

5. Parallel Layout

The process of developing applications for the Cell BE Processor is more involved than programming for

other common architectures. The design of the Cell BE Processor necessitates an application design

consisting of separate programs for the general purpose PPE and the SIMD SPEs. The program running on

the PPE is comprised of control code for the application. For example, the PPE program initiates the SPE

threads and keeps the SPE programs synchronized. Synchronization and other communication between the

PPE and SPE programs is maintained through a mailbox messaging system. The SPEs are vector processors,

so special care must be taken to vectorize the programs that run on the SPEs. Libraries developed

specifically for the Cell Processor handle vector operations. The SPE programs must explicitly handle the

transfer of data between the SPEs and main memory. Each SPE has a 256 KB LS used to hold the program

and its data. The PPE program has direct access to main memory, but the SPE programs must use DMA

operations to transfer data back and forth from main memory. The limited memory capacity of the SPE's LS

requires the implementation of mechanisms to repeatedly fetch data from main memory during computation.

The DMA operations are handled independently from the SPE core by the MFC. The MFCs allow for data

transfer times to be hidden in situations where computations associated with a chunk of data are more time

consuming than the time taken to transfer the data. Hiding data transfer latencies is accomplished through

explicitly implemented buffering techniques. Such techniques range from simple double buffering to

software-hyperthreading. For the previously mentioned reasons, the implementation of applications for the

Cell Processor clearly requires attention to details beyond those needed in the development of software

systems for other common architectures.

The software development kit provided by IBM facilitated programming for the PlayStation and proved to be

quite helpful is available at http://www.ibm.com/developerworks/power/cell/. The IBM SDK included a

compiler designed specifically for programs written for the cell, as well as an array of functions which

simplified programming. One of the data types available in cell programming is the vector float, which is an

array of floats of size four. The SDK allowed us to perform calculations on the entire vector float, instead of

forcing us to loop through the array and compute it individually. Other functionality that the SDK provides

includes SPE timings and data transfers. However, the SDK did not simplify PPE/SPE communications, and

any data passed between the two needed to be extracted using direct memory access.

Figure 4. A render of the triangulated tsunami data

Figure 5. Above: A 2D depiction of height field data rendered in MatLab. Below: The same height

field data rendered as a 3D image in the PS3 ray caster.

6. Heat Equation on the PS3

As an experiment in parallel programming we tried to choose a simple 2-D problem that we would be able to

try and run in parallel using the SPUs on the PS3's cell processor. The heat equation ( c!2u =

"u

"t

) , where u

is the temperature field and c is the thermal diffusivity,turns out to be pretty simple when you look at the

steady state case where

!u

!t

= 0 . When we solve this equation to figure out how to find u i, j

(

) we find that

we are averaging the surrounding nodes to find the value of the center node. In the figure below we solve the

steady state heat equation by using the recursive equation x =

a + b + c + d

4

at each iteration.

Figure 6. Central differencing cross: Essentially we are taking the average of these points to compute

the next iteration.

The tricky part for parallel computation is in the domain decomposition. We get around this by splitting the

region into layers for each SPU then share the boundaries between each layer. The boundary needs to contain

two nodes to allow the layers to communicate with each other. One of the difficulties we ran into is

answering the question of how to solve the problem of communicating between the nodes. Poor

communication results in impaired performance, so we spent a lot of time looking at efficient ways to

communicate between SPU nodes without using the PPU. To keep things efficient we found it was best for

the SPUs to directly access values from the memory and the PPU would simply be used to coordinate when

data was ready to transfer, instead of actually doing the data transfer.

Figure 7. Domain decomposition between SPUs

7. Set Up of the Driven Cavity Problem

7.1 Overview

Problems in computational fluid dynamics (CFD) may prove to be a better application for the PS3 Cell BE

processor than ray tracing. The nature of such problems, generally, is grid generation coupled with ODE and

PDE solutions through iteration and convergence methods. Thus the problem can be split up and each

processor assigned to a section of the grid. Because CFD focuses more on computation rather than

generating high quality visualization, these problems may be more suited for our experimentation on the

PS3. The driven cavity problem was chosen to test CFD on the PS3. With the particular parameters chosen,

computation on a standard desktop CPU can give a convergence time on the order of 30 minutes, which

leaves room for improvement on a parallel processor.

The driven cavity problem is well known within the field of computational fluid dynamics and provides a

good premise for testing methods as well as studying the effects of input parameters. The problem consists

of a rectangular cavity with an upper plate, which moves to the right at a velocity of uo. This moving plate

drags the fluid to the right and sets up a convection pattern in the cavity. The purpose of this section is to

explain this problem and how we would implement it on the PS3.

7.2 Theory

The governing theory behind the driven cavity problem is the vorticity-stream function formulation in two-

dimensional flow. For two-dimensional flow, a function called the stream function, which satisfies the

continuity equation, is defined by:

!

u =

"#

"y

(1)

!

v = "

#$

#x

(2)

The lines of constant ψ represent streamlines, the difference between the streamlines gives volumetric flow.

To find vorticity, the vorticity transport equation can be derived from the primitive formulation of the

incompressible Navier-Stokes equations and is shown in Equation 3.

!

"#

"t

+ u

"#

"x

+ v

"#

"y

= $

" 2#

"x 2

+

" 2#

"y 2

%

&

'

(

)

*

(3)

Finally streamline values are calculated from the stream function equation, which is derived from the

definition of vorticity along with Equations 1 and 2. That is:

!

" 2#

"x 2

+

" 2#

"y 2

= $%

(4)

From these four equations, the behavior of the fluid in the cavity can be calculated by applying the correct

numerical algorithm.

7.4 Method

The cavity in this problem has dimensions of L by H and the fluid inside the cavity has a kinematic viscocity

of 0.0025 m2/s. The setup for the cavity is shown in Figure 8.

y

uo

j=JM

H

x

j=1

L

i=1

i=IM

Figure 8. The domain of solution for the driven cavity problem

The numerical algorithm for solving the governing equations is to use the FTCS explicit scheme and the

point Gauss-Seidel formulation to solve for the vorticity and the stream function equations, respectively.

The iterative nature of the Gauss-Seidel formulation and the steady state requirements on the vorticity

equation require the use of convergence values for both formulations. A value of 0.001 is set for the

convergence of the stream function and a value of 0.002 is set for the convergence of the vorticity equation.

Table 1 provides a summary of the specific input parameters.

Dimensions and

properties

Grid size

IM = 41

Δx, Δy = 0.01 m

JM = 31

IM = 81

Δx, Δy = 0.005 m

JM = 61

IM = 161

Δx, Δy = 0.0025 m

JM = 121

IM = 321

L = 40 cm

H = 30 cm, 10 cm

uo = 5 m/s

Δt = 0.001 s

ν = 0.0025 m2/s

0.00025 m2/s

0.000025 m2/s

Δx, Δy = 0.00125 m

JM = 241

Table 1. An overview of parameters to be studied, note that some solutions may not be stable enough

to converge.

The FTCS explicit scheme applied finite difference formulation to Equation 3 to obtain,

!

"i, j

n+1

= "i, j

n

+ #t$

"i+1, j

n % 2"i, j

n

+ "i%1, j

n

(#x)2

+

"i, j+1

n % 2"i, j

n

+ "i, j%1

n

(#y)2

&

'

(

)

*

+ % ui, j

n

"i+1, j

n %"i%1, j

n

2#x

% vi, j

n

"i, j+1

n %"i, j%1

n

2#y

&

'

(

(

)

*

+

+

(5)

for which the vorticity at the next time step can be found. The numerical scheme used to solve the stream

function was the Gauss-Seidel formulation which yields the equation,

!

"i, j

k+1 =

1

2(1+ # 2)

($x)2%i, j

n+1 +"i+1, j

k +" i&1, j

k+1 + # 2(" i, j+1

k +" i, j&1

k+1

)

[

]

where

# =

$x

$y

(6)

These solutions are implemented by solving first for the vorticity within the domain which is then updated in

Equation 6 and solved for the stream function. This procedure is repeated until convergence is met. One

detail to note is that the boundary conditions for the vorticity must be updated as Ω is updated.

8. Future Prospects

The unique architecture of the PS3 offers a few optimization methods that could be pursued in the future.

The first method considered utilizes the built in libraries available in the cell processor that are capable of

SIMD mathematical operations. Once implemented, SIMD instructions can operate on 4 sets of data at once.

This is done by vector operations where each entry in a vector of length four can be added, subtracted,

multiplied or divided by the corresponding entries in another vector of length four.

The second method for code optimization is buffering or pipelining. This technique would prompt the SPEs

to load an initial set of data, begin computation on that piece of data while a new set is loaded in. Such an

algorithm would eliminate much of the time wasted by an inactive SPU as illustrated in Figure 9.

Figure 9. A depiction of double buffering

A third method for optimization is implementing MPI (Message Passing Interface) across more than one

PS3. MPI would enable one PPE to send instructions to more than one cell processor. This would increase

the number of SPEs available for use, thus decreasing computation time.

Along with MPI, research into rendering on a platform with a graphics card after completing computation on

the Cell is being considered. This will resolve any visualization issues.

Another prospect for future work is to switch concentration from the ray tracing algorithm to CFD

applications. This is practical because graphics cards can handle most of the issues arising from the ray

tracing algorithm and an algorithm that is capable of running a CFD problem would be of higher value.

Final Remarks

Programming on the PS3 has posed a very interesting and challenging problem to our group and a lot of

ideas and speculation has emerged from our study. Overall, the PS3 is a powerful machine, but much care

must be taken when programming for the Cell BE processor. The architecture of the Cell only allows the

SPU to work on small amounts of data at a time that leads to great amount of data being moved through

memory, and this eventually became a limiting factor in the algorithms used. This limiting factor requires

many levels of optimization in order to bring computation time down to a minimum. Such unique

characteristics bring about a tradeoff between time invested in programming and money invested in an

architecture that is easier to optimize.

Acknowledgements

We acknowledge discussions with Prof. Paul Woodward and help from Spring Liu.

This research was supported by the Middleware grant of IF program of National Science Foundation.

References:

Bellens, P., Perez, J.M., Badia, R.M., Labarta, J., CellSs: a Programming Model for the Cell BE

Architecture, Proceedings of the ACM/IEEE SC 2006 Conference, 2006.

Benthin, C., Wald, I., Scherbaum, M., Friedrich, H., Ray Tracing on the Cell Processor, IEEE Symposium on

Interactive Ray Tracing, 15-23, 2006.

Buttari, A., Dongarra, J. and Kurzak, J., Limitations of the PlayStation 3 for High Performance Cluster

Computing, Tech. rep., Department of Computer Science, University of Tennessee, UT-CS-07-597, 2007.

Buttari, A., Luszczek, P., Kurzak, J., Dongarra, J., Bosilca, G., SCOP3: a rough guide to scientific computing

on the PlayStation 3, Tech. rep., Innovative Computing Laboratory, University of Tennessee Knoxville, UT-

CS-07-595, 2007.

IBM, Cell Broadband Engine Architecture, Version 1.01, 2006.

IBM, Synergistic Processor Unit Instruction Set Architecture, Version 1.2, 2007.

Kurzak, J., Buttari, A., Luszczek, P., Dongarra, J., The PlayStation 3 for High-Performance Scientific

Computing, Computing in Science & Engineering, 84-87, 2008.

William J. Dally, Patrick Hanrahan, Mattan Erez, Timothy J. Knight, François Labonté, Jung-Ho Ahn,

Nuwan Jayasena, Ujval J. Kapasi, Abhishek Das, Jayanth Gummaraju, Ian Buck, "Merrimac:

Supercomputing with Streams", SC2003, November 2003, Phoenix, Arizona.

Williams, S., Shalf, J., Oliker, L., Shoal K., Husbands, P. and Yelick, K., The Potenial of the Cell Processor

for Scientific Computing, CF '06: Proceedings of the 3rd conference on Computing frontiers, 9-20, 2006.

**For insight on early work done on modern stream computing and GPU processing look up the

Merrimac project and Pat Hanrahan.