Please enable JavaScript.

Coggle requires JavaScript to display documents.

SKR3202 - Coggle Diagram

- - - - Programmer is responsible for both identifying and actually implementing parallelism
      - time consuming, complex, error-prone and iterative process
    - - type of tools is parallelizing compiler or pre-processor
      - Fully automatic
        
        compiler analyze and identifies the opportunities for parallelism
        
        identifying inhibitors to parallelism and possibly cost weighting whether parallelism can improve performance or not
        
        common target for automatic parallelization is loop, (do, for)
      - Programmer directed
        
        using compiler directives, the programmer tells compiler how to parallelize the code
        
        may able to used in conjunction with some automatic parallelization also
  - - - each of the process is independently determinable
    - - the process B is dependent to process A. which mean A need to be execute first then B can be execute
      - (embarrassingly parallel)
    - - Know where the most of the real work is done
      - profilers and performance analysis tools can help here
      - focus on parallelizing the hotspots and ignore those sections of program that require little CPU usage
    - - find the issue that slows a program down
      - May be possible to restructure program to eliminate unnecessary slow areas
  - - - domain decomposition
        
        data associated with a problem is decomposed
        
        each parallel task then works on a portion of the data
      - functional decomposition
        
        Focus is on the computation that is to be performed trather than on the data manipulated by the computation
        
        the problem is decomposed according to the work that must be done
        
        each task performs a portion of the overall work
        
        functional decomposition lends itself to problems that can be split into different task
        
        ecosystem modeling
        
        each program calculates the population of given groups
        
        each groups growth depend on that of its neighbors
        
        each process calculates its current state, then exchange information with the neighbor population
        
        all task then progress to calculate the state at the next time step
        
        signal processing
        
        audio signal data the passed through four distinct computational filters and each filters is a separate process
        
        climate modeling
        
        each model component can be thought of as a separate task
  - - - don't need - where there's no need for task to share data. when the program can easily be distributed to multiple task that act independently of each other portion
      - need - the program require task to share data with each other. (data dependent)
    - - Cost of communications
        
        communications require synchronization between task that can result in task spending time "waiting" instead doing work
      - latency vs bandwidth
        
        -
      - visibility of communications
        
        Message passing model - communications are explicit visible and under the control of the programmer
        
        Data parallel model - communications often occurs transparently
      - synchronous vs asynchronous
        
        synchronous - must wait until communications have complete
        
        asynchronous - allow task to transfer data from one another, other work can be done while the communication are taking place
      - Scope
        
        knowing task must communicate with other is critical during design stage
        
        point-to-point - involve two tasks with one act as sender and other as receiver
        
        collective - involve data sharing more than two tasks
  - - - implies that all tasks are involve
      - each task performs its work until reaches barrier then stop or block
      - when last task reach barrier, all task is synchronized
    - - can involve any number of tasks
      - typically used to serialize access to global data
      - only one task at a time may use semaphore
    - - involve only those task executing a communication operation
      - when a task performs, some form of coordination is requires with the other tasks
  - - - distributed memory architecture - communicate require data at synchronization points
      - shared memory architecture - synchronize read/write operation between tasks
  - - - low computation to communication ratio
      - facilitates load balancing
      - implies high communication overhead and less opportunity for performance enhancement
    - - high computation to communication ratio
      - implies more oppurtuniy for performance increase
      - harder to load balance efficiently
  - - - I/O regarded as inhibitors to parallelism
      - I/O are immature
      - write operations will result in file overwriting
      - read operations will be affected by fileservers' ability to handle multiple read request at the same time
      - parallel I/O not available for all platform
    - - some parallel file system are available
      - the parallel I/o programming interface specification for MPI
      - vendor and "free" implementations are now commonly available
- - - - Conventional
      - speed is limited by the rate at which computer can transfer information internally
      - Serial computer
      - single instruction - only one instruction stream at a time
      - single data - only one data stream is being used at a time
    - - Data parallel, vector computing
      - single instruction - all processor execute same instruction at a time
      - multiple data - each processor operates on different data element
    - - systolic arrays
      - More of an intellectual exercise than practical configuration
      - Few built, but commercially not available
      - single data stream is fed into multiple processing units
      - each processing unit operates on data independently via independent instructions stream
    - - very general, multiple approaches
      - unlike SISD, MISD, MIMD computer works asynchronously
      - Shared memory
        
        easy to build, SISF can easily be ported
        
        Limitation
        
        reliability
        
        any failure of memory component and processor affects whole system
        
        Increase processor leads to memory contention
      - Distributed Memory
        
        Network can be configured to Tree, Mesh, Cube
        
        Easily/readily expandable
        
        reliable (any CPU failure does not affects whole system
      - multiple instruction - every processor may be executing different instruction stream
      - multiple data - every processor may be working with different data stream
    - - cost = square root of speed
      - speed = square of cost
      - speedup = log2(P)
      - observe speedup = ts/tp
  - - - smallest extension to existing system
      - program conversion is incremental
    - - completely new systems
      - program must be reconstructed
    - - slow communication form of distributed
  - - - monolithic OS
        
        better application performance
        
        difficult to extend
      - layered OS
        
        easier to enhance
        
        each layer of code access lower level interface
        
        low-application performance
      - microkernel based OS
        
        tiny OS kernel provide basic primtive
        
        traditional services becomes subsystems
        
        monolithic application performance competence
        
        OS = microkernel + user subsystem
  - - - course-grain - computational is larger than communication
      - fine-grain - computational is smaller than communication
- - - - software has been written for serial computation
      - to be run on single computer with single CPU
      - a problem is broken into discrete series of instructions
      - only one instruction may execute at a time
      - instruction are execute one after another
    - - simultaneous use of multiple compute resource
      - to be run on multiple CPU
      - problem is broken into part that can be solved concurrently
      - each part the broken down to a series of instructions
    - - single computer witl multiple processor
      - an arbitrary number of computers connected by a network
      - single computer multiple processor and some specialized computer resources
      - combination of both
    - - the ability to
        
        broken apart into discrete pieces that can be solve simultaneously
        
        execte multiple program instructions at any moment in time
        
        solve in less time with multiple compute resources than the single compute resource
    - - Parallel computing is an evolution of serial computing attempts to emulate what has always been the state of affairs in the natural world
      - Parallel computing has been considered to be the high end of computing and has been motivated by numerical simulations of complex systems and grand challenge problem
      - Commercial applications are providing an equal or greater driving force in the development of faster computer. these applications require the processing of large data in sophisticated ways
  - - - Limit serial computing
        
        both physical and practical reasons pose significant constraint to simply building ever faster serial computers
      - Transmission speed
        
        the speed of serial is directly dependent upon how fast data can move through hardware
      - Limits to miniaturization
        
        processor technology is allowing an icreasing number of transistore to be placed on a chip
  - - - Uses to stored-program concept
      - The cpu executes a stored program specifies a sequence of read and write operations on the memory
      - Basic design
        
        memory is used to store both program and data instruction
        
        Data is simply information to be used by the program
        
        program instructions are coded data that tells computer what to do
        
        A CPU gets instruction from memory, decodes the instruction and sequentially performs them
    - - Distinguishes multi-processor computer architecture according to how they can be classified along the two independent dimension of instruction and data. each of this dimension can have only one of two possible states: Single or Multiple
      - Matrix
        
        SISD
        
        SIMD
        
        MISD
        
        MIMD
- - - - inter-process communication
      - idling cause load unbalance, synchronization, presence of serial component
      - excess computation
  - - - serial = Ts - Time from beginning and the end of execution
      - parallel = Tp - time from start of computation until the last processor finish its execution
    - - How much performance gain by parallelizing over sequential implementation
      - ratio of time taken to solve problem on single processor to solve the same problem on parallel computer with p identical processors
      - Adding n number by using n processor
        
        each processor assigned 1 number to be added
        
        end computation one processors store the sum of all number
        
        Theoretically, speedup can never exceed the number of processors
        
        if best sequential algo takes Ts units to solve a problem on single processor, then speedup of p can be obtained on p processor if non of the processors spends more than Ts/p time solving the problem
        
        a speedup greater than p is possible only if each processor spends less than Ts/p time solving the problem. a speedup greater than p is sometime observer = super linear speedup
      - Ts / Tp or Tp / p
    - - ideal parallel system contains p processor can deliver speedup equal to p
      - ideal cannot achieved coz while executing processor cannot devote 100% to computation of algo
      - need to spent in communication
      - Efficiency is a measure of the fraction of time which a processor is usefully employed
      - E = S / p or Ts / pTp
      - in ideal parallel system, speedup is equal to p and efficiency is equal to 1
      - in practice, speedup is less than p and efficiency is between zero and one, depending on the degree of effectiveness with which processor are utilized
    - - solving a problem is defined as the product of parallel runtime and the number of processors used
      - the cost of solving a problem on single processor is the execution time of the fastest known sequential algo
      - cost reflects the sum of time that each processor spends solving the problem
      - parallel system is said to be cost optimal if the cost of solving problem on a parallel computer is proportional algorithm on a single processor
      - since efficiency is the ratio of Ts to Tp , a cost optimal parallel system has an efficiency of one
      - cost is also known as work or processor-time product
      - cost optimal system is known as a p * Tp
- - - - multiple processor operates independently but share same memory resource
      - change in memory location effected are visible to all processor
      - can be divide in two class
        
        UMA
        
        symmetric multiprocessor
        
        identical processors
        
        equal access time to memory
        
        NUMA
        
        linking two or more SMP
        
        one SMP can access memory of another SMP
        
        not all processor have equal access time
        
        memory access across link is slower
      - advantages
        
        global address space provide user-friendly programming perspective to memory
        
        data sharing between task is fast and uniform
      - disadvantages
        
        lack scalability between memory and CPU
        
        Adding more CPU increase traffic on shared memory
        
        programmer responsibility for synchronization constructs that insure correct access of global memory
    - - combination of shared and distributed memory
      - Advantages and disadvantages is common to shared and distributed memory
    - - require a communication network to connect interprocessor memory
      - processor have their own independent local memory
      - changes in local memory does no affect other processor's memory
      - when a processor need data from other processor, its usually the task of the programmer to define when and how data is communicated
      - advantages
        
        memory is scalable with number of processor
        
        the increase of processor, the size of memory increase proportionately
        
        each processor can rapidly access its own memory without interference and without overhead incurred
        
        cost effectiveness
      - disadvantages
        
        The programmer is responsible for many details
        
        difficult to map existing data structures
        
        NUMA
  - - - tasks share a common address space, which read and write asynchronously
      - mechanism such as locks / semaphores my be used to control access to the shared memory
      - advantages - no need to specify explicitly the communication of data between task. program development can often be simplified
      - Disadvantages - more difficult to understand and manage data locality
      - implementations
        
        on shared memory platform - native compilers translate user program variables into actual memory addresses, which are global
        
        No common distributed memory platform implementation exists. KSR ALLCACHE approach provide shared memory view of data even though physical memory of the machine was distributed
    - - a single process can have multiple, concurrent execution paths
      - a single program that
        includes number of subroutines
        
        Main program a.out is schedule to run. then loads and acquire all necessary resource to run
        
        a.out performs some serial work, then create number of task (threads) than can be schedule and run by OS concurrently
        
        each thread has local data but also shares the entire resources of a.out. this saves the overhead associated with replicating programs resource for each thread
        
        an thread can execute any subroutine at the same time as other threads
        
        threads communicate with each other through global memory.. this require synchronization construct to insure that more than one thread is not updating the same global address at any time
        
        threads can come and go, but a.out remains present to provide the shared resources until application is complete.
      - implementation
        
        a library of subroutines that are called from within parallel source code
        
        a set of compiler directives imbedded in either serial or parallel source code
        
        in both case, programmer is responsible for determining all parallelism
        
        These implementations differed substantially from each other making it difficult for programmers to develop portable threaded application
        
        standardization efforts
        
        POSIX
        
        library based: requires parallel coding
        
        very explicit parallelism; require significant programmer attention to detail
        
        OpenMP
        
        compiler directive based; can use serial code
        
        can be very easy and simple to use - provide for "incremental parallelism
    - - a set of tasks that use their own local memory during computation
      - multiple task can reside on the same physical machine
      - task exchange data through communication by sending and receive message
      - commonly comprise a library of subroutines that are imbedded in source code
      - MPI was formed with primary goal of establishing a standard interface for message passing implementations
      - MPI now the industry standard for message passing
      - for shared memory architectures, MPI implementation usually don't use network for task comm instead they use shared memory (memory copies) for performance reasons
    - - a set of task work collectively on the same data structure, however each task works on a different partition of the same data structure
      - the data set is typically organized into common structure
      - most of parallel work focuses on performing operation on a data set
      - task perform the same operation on their partition of work
      - shared memory
        
        all tasks may have access the data structure through global memory
      - distributed memory
        
        the data structure is split up and resides as "chunks" in the local memory of each task
      - implementation
        
        accomplished b writing a program with data parallel construct
        
        the construct can be call to a data parallel subroutine library
        
        Fortran 90
        
        High performance Fortran
    - - two or more parallel programming models are combined
      - currently common combination is MPI with either POSIX or OpenMP
      - this hybrid lends itself well to the increasingly common hardware environment of network SMP machines
      - another common example is combining data parallel with message passing
      - data parallel implementations on distributed memory architecture actually use message passing to transmit data between task, transparently to the programmer
    - - Single program multiple data
      - high level programming model that can be built upon any combination of programming models
      - single program is execute by all task simultaneously
      - at any moment in time, task can be executing the same or different instructions
      - SPMD have necessary logic programmed that allow different task to be branch or conditionally execute only those part of program
      - all task may use different data
    - - multiple program multiple data
      - Same with spmd, can be built with any combination of parallel programming
      - MPMD application have multiple executable object files
      - while the application is being run in parallel, each task can be executing the same or different program as other tasks
      - all task may use different data
- - - - library standard defined by a committee of vendors, implementers and parallel programmer
      - 100% portable: one standard, many implementations
      - available on almost all parallels machine
    - - most parallel supercomputer vendors provide optimized implementations
    - - used to create parallel programs based on message passing
      - normally the same program is running on several processors
      - processor communicate using message passing
    - - simplest message : an array of data of one type
      - predefined types correspond to commonly used types in a given language
      - user can define more complex types and send packages
    - - collection of processor working on some part of a prallel job
      - used as a parameter for most MPI calls
      - MPI_COMM_WORLD includes all of the processor in your job
      - processor within a communicators are assigned number 0 to n-1
      - can create subset of MPI_COMM_WORLD
    - - data values are transferred from one processor to another
      - one processor sends the data and another receives the data
      - synchronous - call does not return until the message is sent or receive
      - Asynchrounous - call indicates a start of send or receieve, and another call is made to determine if finished
- - - - structures
      - multi-structured
      - unstructured
    - - social networking where data groqth increasing exponentially
      - analyzing and managing these data properly is the key to business expansion and growth
    - - storage
        
        petabytes
      - proccessing
        
        must be fast and effective
      - diversity of data
        
        structure, semi-structured, multi-structured and unstructured
      - cost
        
        increases cost to manage and process large datasets
    - - volume
      - variety
      - velocity
      - complexity
      - validity
    - - batch-based stored
        
        process large volumes of data
        
        can be periodic or one-time processing
        
        batch result are produced after data is collected, entered and processed
        
        separate technique or program for input, processing and output
      - real time data stream processing
        
        data captured, processed and acted 24/7
        
        computing data real-time
        
        advantages
        
        system scales elastically based on need
        
        instant results
        
        parallel processing is available
    - - choice of tool is driven by
        
        who is going to use the data
        
        the business requirement for a particular scenario