Parallel Computing in bioinformatics

Parallel Computing in bioinformatics

pipes, sockets and message passing

Can help to get answers in real-time for point of care testing

Types of parallelism

Cluster

Symmetric Multiple Processing (SMP)

Single Instruction, Multiple Data (SIMD)

one instruction operating simultaneously on different data - vectorizing code - parallelism within a core

Using multiple cores on same computer (threads)

Using multiple cores (across computers)

ad hoc - bunch of PCs over ethernet

Cluster specific - high density and fast interconnect. E.g. spartan

Highly specialized - high density, low power, low latency and very fast interconnect. E.g. IBM BlueGene

Break tasks into subtasks - either dependent or independent (embarrasingly parallel)

Tools that use multithreading - BWA, bowtie2, samtools

To use SMP - standard platforms = POSIX threads, OpenMP, Unix Shell

Sometimes threads need to talk = Inter-Process Communication (IPC)

time stamped files

pipes, sockets and message passing

shared memory

signals

GPUs can do MIMD - different operations on different parts of a vector array

Libraries - Numpy, GSL, BLAS

Tools - HMMER (sequence alignment), FASTA 35+, SWIFT (full local, global, semi alignment), BWA, bowtie (short read alignment)

For using - lecture 21, slide 22-24,27,32,36

Dedicated pipeline system - Make, Snakemake, Nextflow, Cromwell

In Unix, piping (|) does things parallelly implicitly (all processes separated by pipes are started simultaneously on different cores)

Sub-shells (<) - instead of uncompressing first and feeding the whole thing as output, it can be uncompressed and fed out on the fly - thus saving storage memory

GNU parallel, makefile