Please enable JavaScript.
Coggle requires JavaScript to display documents.
Alignment Part 2 - Coggle Diagram
Alignment Part 2
Interpreting BLAST output
% identity
number of
identical residues
divided by number of matched residues (ignoring gaps)
Positives
Gives measure of the fraction of residues that are either identical or similar (represented by a +)
Gaps
Shows residues that were not aligned
Length
Length of alignment
Query sequence
Submitted sequence
May contain XXXX regions (inserted automatically by BLAST to mask regions containing many identical residues) i.e. low complexity segments
Subject (sbjct) sequence
Comparing Sequences Against a Database
To discover similarities (most used and powerful tool in Bioinformatics)
In essence its the same as comparing 2 seqs (pairwise alignment) but repeated several 1000 times (time consuming and computationally demanding)
BLAST
(
B
asic
L
ocal
A
lignment
S
earch
T
ool)
Fasted + most widely used
heuristic
tool for
pairwise sequence comparison
1st Stage: Basic heuristic premise is that a good alignment of 2 seqs will have a
high-scoring pair
of aligned
words
where a word in this case is typically 3 residues (nucleotides or AAs)
Program first scans the DB proteins for words that score a minimum value equal to a set threshold parameter (
T
) when aligned with the
query
seq (seq of interest) Each such alignment is termed a
hit
2nd Stage: Process has to determine whether or not each hit lies within a larger seq alignment
Available in various formats
BLAST
P
Compares AA query seq against a protein seq DB
BLAST
N
Compares a nucleotide query seq against a nucleotide seq DB
BLAST
X
Compares a nucleotide query seq translated in all reading frames against a protein sequence DB
Multiple Sequence Alignments
Fast
heuristic
(best guess) methods are most often used for multiple seq alignments
Most widely used approach is
progressive alignment
on which
Clustal
programs (V, W & X) are based
Uses
dynamic programming
to build a multiple alignment starting w/ closest sequences and progressively adding more distant seqs
2 step method
Sequences first grouped by similarity to produce a
guide tree
or dendrogram
Seq then progressively aligned according to the branching order in the guide tree
Closest seqs are first aligned 2 by 2, intermediate then aligned 2 by 2 and so on
Extremely fast yet has 2 short comings
Seqs aligned at beginning are never realigned
Early mistakes cannot be corrected
Several MSA resources but 2 predominating
The Clustal Series (V,W + X)
Clustal V carries out profile alignments and generate trees using fast
Neighbourhood-Joining
method
Clustal W followed and W is for
weighting
to indicate the program could assign weights to seq and program parameters (more the better)
Clustal X provides a Windows interface and is easy to use, produces alignments + quality analyses quickly w/ coloured graphical display used to illustrate low-scoring regions
T-coffee (and EXPRESSO)
T
ree-based
C
onsistency
O
bjective
F
unction
f
or alignment
E
valuation
Produces better alignments for seq similarities less than 30%
At each step of progressive alignment Tcoffee makes use of pairwise info between
all pairs
of seqs in library not just those being aligned at that stage
Making T-Coffee less prone to the most common type of error in progressive alignment programs in which a poor alignment early in a progression 'locks in' a particular error to the alignment
EXPRESSO
Latest development of TCoffee
Uses BLAST to search the PDB for structures whose sequences are similar to query sequence
Slower than T-Coffee if it finds enough structures it provides the most accurate sequence alignments available today
Tests for Significance for Alignments
Scores
Raw score
S
of the alignment is calculated by summing scores for each letter-letter and letter-to-null position in alignment
Scores for each position of alignment are derived from an SSM usually BLOSUM or PAM
Bit score (
S'
) accounts for the type of scoring system used therefore more informative allowing diff alignments even those employing diff scoring matrices to be compared
The higher the score the better the alignment
E (expect) value
Expected number of change alignments w/ a score of S or better
Basically what is the likelihood that the similarity between query and database sequences occurred by
chance
The
lower
the E value the more significant the hit