PLASMA, that enables writing of portable Single-Instruction, Multiple-Data (SIMD) programs. PLASMA uses an Intermediate Representation (IR), which provides succinct and clean abstractions (i.e., free from details of any particular SIMD architecture) to enable programs to be compiled on different PUs. Then, using a runtime, these programs can be automatically multithreaded and executed on different PUs, such as a GPU, a CPU, and a Cell BE; for example, aforloop in a kernel can be split across a CPU and a GPU. The runtime takes care of load balancing and distributed memory and also ensures that before any computation, data are moved from the CPU to the GPU or vice versa.