Please enable JavaScript.
Coggle requires JavaScript to display documents.
Multiple Sequence Alignment - MSA, Multiple Sequence Alignment - Coggle…
Multiple Sequence Alignment - MSA
Multiple Sequence Alignment
Alignment of more than 2 sequences at a time
Uses
identify homology
- relationships between organisms
identify regions of high sequence conservation
identify protein domains/families -
functional homology
Precursor for phylogeny/tree inference
Phylogenies
Tool: IQ-Tree
Scoring Methods
- to evaluate how good a multiple alignment is/measure of multiple alignment (score/cost)
Sum of Pairs (SP)
sum up match and mismatch scores between all possible pairs for each base position
Assuming that all pairs are equally likely to be evolutionarily related
Consensus/Star
Distance of all the sequences from one chosen sequence
tree based (progressive)
distance to get to one sequence from another, along a tree - as a measure of evolutionary distance
Lecture 15, slide 9
Entropy based
Minimum entropy in a column (at a base position) - a completely conserved position in an alignment gives a score of 0
Lower the score, better the alignment
Can
discriminate between common and rare
mismatches
Hence can
differentiate between alignments with same number of mismatches
, since it considers the distribution of mismatches across columns (for total score)
More
biologically relevant
since conservation differs across columns
Multiple Alignment Methods
Pairwise Multiple Sequence Alignment (MSA)
Find likely areas of alignment using
pairwise alignment by DPA
(only in those areas, not the whole sequences)
Use SP scoring
computationally intensive - O(n^k) for k number of sequences
Star Alignment
Compute pairwise alignment score for all pairs
Pick a
centre sequence (as ancestor)
and then align all other sequences with this
Merge all the resulting consensus sequences into one (e.g. gaps add up)
Lecture 15, slide 12
Option 1: With
each sequence as centre
,
sum over the similarities
between centre and all other sequences. Choose that centre which maximises this summation.
Option 2: Do
star alignment
with each sequence as centre and pick the best centre
Progressive Alignment
Build a
guiding relationship tree
based on the distances between all pairs
Align
most closely related pairs from the tree first
, get their consensus
Use DPA for pairwise
Align the
next most related
to that consensus and so on
Lecture 15, slide 15-17
Challenges
Computing pairwise alignments for all pairs using
DPA
is computationally intensive
Greedy bottom-up clustering
- Gaps and errors that get set during the early alignments get propagated unchanged throughout
Tools & Variations
CLUSTALW
- weighted progressive alignment
Most closely related sequences are weighted small to be aligned first
Lecture 15, slide 20
SAGA
- Iterative refinement to Progressive alignment
Keep improving alignment until no improvement is possible - take one sequence from the beginning out and realign it
T-Coffee
refinement to Progressive Alignment
Consistency step added - for each starting pairwise alignment,
aligns a third sequence
(next closest) and then chooses the most optimal pair from the three to move forward
Slower
but good (
combats greediness
) for distantly related protein sequences/motifs
Muscle
refinement to Progressive Alignment
Build
quick approximate
relationship guide
trees
with just short hits (short gapless matches) between pairs of sequence - rather than full pairwise alignment
Compute MSA from tree and pairwise distances from MSA, and build new tree
Recompute MSA using new tree
Refine alignment by iteratively
splitting the tree
into 2 groups and realigning the multiple alignments from the 2 groups.
Repeat this step
until no further improvement in alignment score
Lecture 15, slide 26
Faster and combats greediness
hidden Markov Models
for distant protein sequences
MAUVE
- Multiple Bacterial Genome Alignment
https://www.ncbi.nlm.nih.gov/CBBresearch/Przytycka/download/lectures/PCB_Lect05_Multip_Align.pdf
tool: Snippy-core for core genome alignment