Please enable JavaScript.
Coggle requires JavaScript to display documents.
Algorithms for functional genomics (Mapping and alignment alignment…
Algorithms for functional genomics
Mapping and alignment
alignment measures similarity but be careful of "
repeat regions
"
*usage: similarity-> common ancestors/ function
alignment is a first step in many kinds of analysis, and is a part of many other algorithms.
database searching
Phylogeny - Given a set of related sequences, infer the evolutionary relationship between them, or evolutionary "time" given a mutation rate
De novo sequence assembly - reconstruct a single large sequence from many small pieces of that sequence (DNA, RNA)
Read Mapping - Given a reference sequence, align many small reads to it ○ infer variants ○ quantify mRNA (RNA-seq)
Protein function - find protein sequence regions which are similar to proteins with known function, to infer similar structure and function
Edit distance:
Hamming distance: only substitution
√ Levenshtein distance: indels or substitution
O(3^N)
Global Alignment:
full length of each sequence
Needleman-Wunsch Algorithm
time/space complexity: O(n×m)
A Dynamic Programming Algorithm (DPA)
banded alignment: O(n×d)
Keep track of which cell(s) was/were the origin of the best score(s) for each given cell, and use this information for traceback. This increases storage requirements by a constant factor (i.e. they are still O(N^2)).
Or, during traceback, calculate which cells(s) could have been the origin of the best score(s) for each cell. This increases the computational cost of traceback by a constant factor (i.e. it is still O(N)).
when: whole sequences are expected to be related
local alignment:
Smith-Waterman-Gotoh Algorithm
when: only part of the sequences to be related
used to find similarity between a short query sequence and a large database of sequence
只看得分高的那一段,Traceback for the optimal alignment.
O(n×m).
A Dynamic Programming Algorithm (DPA)
Faster tool:
BLAST: many longs vs. few shorts
fast but less sensitive than full local alignment
Extend the match in both directions to find best scoring alignment around that “match
Fast lookup to find high scoring matches as "seeds"
Scoring variations: decides the "best" alignment
generalise the score for each pair
Affine gap costs
rather than simple indel cost
more likely to insert (or delete) multiple bases than a single base multiple times.
an indel of length k , has cost
w(k)= h + g×k
Where h is the cost to open a gap
g is the cost to extend a gap
used in BLAST, both local and global can use
O(n×m)
Semi-Global alignment & read mapping
NGS[next generation sequencing] read alignment
find where all the (short) reads/query fit on our reference genome, quickly and accurately.
Inexact alignment:
○ errors in the read
○ true differences between sample & reference
scoring alignments: >80% the number of matches/the length of read
Semi-Global alignment(glocal)
Match whole of the [short]query against part of [long]reference
costly alignments on the entire reference genome
Seed and Vote Alignment Strategy
rely on the relatively low error rates in read data
steps
If most of the read sequence is correct and we do not expect too much genetic variation between reads and reference,
Then we can reliably find subsequences within reads that match the reference perfectly.
Use an index to the reference to quickly query “seeds” from the read.
Align reads to seeded regions of the reference only.
read mapping
Two common sources of error
○ Sequencing error. < 1% on current hardware
○ Incorrect reference.
■ Poor reference assembly
■ Reference is for a related but different organism
■ Same organism but individuals are different!
■ Cancer vs normal tissue
many shorts vs. few longs
Tools: BWA...
ambiguous alignment: one short read can fit on several places of the reference
Multiple-mapping reads
● Align to all possible places ○ useful in some situations ○ but belongs to only one "real" place
● Align to the first place you find ○ not a good idea... but some tools still do it
● Align to a random choice of all valid places ○ useful in some situations
● Don't use multiple-mapping reads
○ often necessary if calling SNPs
data format
FASTQ
SAM/BAM
End-gap free Alignment
De novo Genome assembly
Reconstruct the original genome from the sequence reads only
why
●Produce a reference for new species
● A wide variety of molecular biological tools become available or more effective
● Coding sequences can be studied in the context of their related non-coding (eg regulatory) sequences
● High level genome structure (number, arrangement of genes and repeats) can be studied
● Genomic variation can be studied by comparing against the reference
Genome -> short fragments/short sequences/reads -> sequencing[stored in FASTQ]
FASTQ:
Encoded quality values, one symbol per nucleotide
Phred Qualities: formalization
quality scores/value for each nucleotide:
the measurement of reliability for each base
derived from the physical process
quality encoding
letters/symbols to represent numbers
Q0-> 40: bad -> excellent
10 -> 1 in 10 -> 90% bad
20 -> 1 in 100 -> 99% maybe
Ok/ very good/ excellent
P = 10 ^ (-Q/10)
Q= Phred quality score
P = probability of base call being incorrect
accuracy = 1-P
steps
• Build an assembly graph – a picture of read connections
assembly graphs
Directed graph
● Nodes (vertices): Reads
● Edges: Overlaps
● Paths: Connected reads → Contigs!
Hamiltonian path: travel man
NP-hard
• Find all overlaps between reads – hmm, sounds like a lot of work…
data amount
coverage: aka breadth
depth
○ sequencing errors
○ polymorphisms 多态性
Random reads
Coverage = 1 - e-Depth
100% coverage may means part only have one evidence
generally, 50% in real
Greedy Extension
Compute all overlaps: O(N2) x O(L2)
Find longest overlap and join them
Repeat 2. until there are no more overlaps.
Issues:
● Makes locally optimal decisions which may not be globally optimal.[也许该overlap不属于200长度而是属于50长度,尽管达到了local optimal,但是实际上可能不是正确位置]
● Computationally expensive
Sequencing by Hybridisation/,haɪbrɪdɪ'zeʃən/ 混合淡化技术(SBH)
Sequencing method:
○ Put all possible k-mers on a chip (k=20-25)
○ Chop up target DNA and add fluorophores /'fluərə,fɔ:/
○ Hybridise DNA to chip
○ Measure fluorescence on each spot on chip
○ Get list of k-mers present in target DNA.
problems
● Paths can be ambiguous due to low k values
● k-mers for k=5 are not that unique.
Uniqueness comes at higher values of k.
● Number of probes (spots on a chip) required is 4^k
● Cross hybridisation is a problem
【like G&C loves to stick with everything else, if lots of G/C,
there would have lots of false fluorescence】
● Not all k-mers hybridise equally
K-mer de Bruijn[brown] Graph Assembly
● A directed graph
● Nodes in the graph are (k-1)-mers (the overlaps)
● Edges represent k-mers (which overlap by k-1 symbols)
balanced/unbalanced nodes: incoming and outcoming interactions
Graph traversal[ “Eulerian Walk”]:
Start at an unbalanced node with Degree ≠ 0
Walk through the graph, adding edges
Stop when you hit another unbalanced node
A directed, connected graph is
Eulerian
if and only if it has at most 2 semi-balanced nodes and all
other nodes are balanced.
Graph is
connected
if each node can be reached by some other node
Eulerian walk visits each edge
exactly once
Not all
graphs have Eulerian walks.
Why
why K-mers
● Not all reads will overlap another read
○ Read errors
○ Coverage “holes”
● Not all reads are the same length (depending on technology)
no alignment - no overlap computation hit!
min/max overlap = 4
● To be able to use de Bruijn graphs, we need reads of length L to overlap by L-1 bases.
Why De Bruijn
fast
● Make k-mers in O(Nr) time
● Eulerian pathing is fast!
○ Can make contigs in O(E) time. (E is # of edges)
● Using de Bruijn graphs is fast!
○ No alignments
○ Can “hash” k-mers to build graph
○ Lookups are fast
○ Can build graph in O(N) time N-total length of reads【for high average coverage and perfect data G<< N, O(min(N,G))】
Hash Table
steps
Select a value for k
"Hash" the reads (make the kmers)
Count the kmers
Make the de Bruijn graph
Perform graph simplification steps - use cov cutoff
Read off contigs from simplified graph
Make the de Bruijn graph
K = shortest read, generate
unique
k-mers list, then:
Take the first k-mer
Split it into a k-1 prefix and k-1 suffix
Add both to graph as nodes (if required: not repeated)
Add a directed edge to the graph connecting them
Label this edge as the k-mer
problems
repeats problem
Larger k value has caused disconnections in the graph!
but fewer loops/contigs
● Lower k
○ More connections
○ Less chance of resolving small repeats
○ Higher k-mer coverage
● Higher k
○ Less connections
○ More chance of resolving small repeats
○ Lower k-mer coverage
● Optimum value for k will balance these effects
if only visit each node once, there would have several contigs
read errors
fork road
solution: more coverage
• Errors won’t be duplicated in every read
• Most reads will be error free
• We can count frequency of k-mers
• Use frequency to clean the de Bruijn graph
Coverage Cutoff- parameter
【coverage 99 vs 1 errors or 50 vs 50 diploid】
why
4 more items...
Ck = C (L - k + 1)/ L
K-mer coverage depth
C-coverage depth
L- read length
Genomes Assembly
double stranded DNA
bottom strand and top strand are reverse complement
When we add them to the graph, we add both the forwards and reverse complement of the prefix/suffix.
Take the first k-mer
Split into prefix/suffix
Add them as nodes
Add the revcom to nodes
Add kmers (forwards and reverse complement) as edges
• Simplify the graph – sequencing errors will mess it up a lot
• Squash small bubbles – collapse small errors (or minor heterozygosity)
• Remove spurs – short “dead end” hairs on the graph[nothing else matches it, then it may be error-prone and can be cut off]
• Join unambiguously connected nodes
– reliable stretches of unique DNA
remove contained strings.
“Not, look” is fully self contained within the other longer read, and can be removed.
• Traverse the graph – trace a sensible path to produce a consensus
• For each unconnected graph – at least one per replicon in original sample
• Find a path which visits each node once
– Hamiltonian path/cycle is NP-hard (this is bad)
– solution will be a set of paths which terminate at decision points
• Form consensus sequences from paths
– use all the overlap alignments
– each of these collapsed paths is a contig
contigs
: Contiguous, unambiguous stretches of assembled DNA sequence
Ends:
Real ends (for linear DNA molecules)
Dead ends (missing sequence)
Decision points (forks in the road)
sequences【may contain errors】-> a few long sequences/
contigs
[based on overlaps]
Ideally, one sequence per replicon复制子
if error ->
majority consensus
if not have majority, stop to break it
difficulties
long reads would not be a problem
Many(million to billion) pieces (read length is very short compared to the genome)
storing in the RAM is a challenge
Lots of overlaps- examining every pair of reads(n^2)
can be reduced to N
Lots of sky (short repeats)
TATATATATA
reads: TATA TATA TATA
Dirty pieces (sequencing errors)
Multiple copies (long repeats)
repeats
type: tandem串联/ interspersed分散
solution: Long reads can span repeats
It is impossible to resolve repeats of length L unless you have reads longer than L.
Ploidy - Haploid[bacteria good], Diploid and Polyploid
slightly different cause bubbles in the assembly graph
Assessing assemblies
contiguilty: /,kɒntɪ'gjuːɪtɪ/ 邻近,接近
desire: ○ Fewer contigs ○ Longer contigs
○ Number of contigs
○ Average contig length
○ Median contig length
○ Maximum contig length
○ “N50”, “NG50”, “D50”
N50: 50% length of genome
从最长的contigs数目开始加,知道加到大于等于50%G
may be produce assembly errors that merge contigs that not should be merged together
completeness: 相对于基因长度
[0,1]
Assembled Genome Size/ estimated Genome Size
estimated Genome size:
based on known assembled
Core Genes
thought to be present in a wide variety of organisms
Number of Core Genes in Assembly
/
Number of Core Genes in Database
correctness
Proportion of the assembly that is free from mistakes
Mis-joins
Repeat compressions- shorten
Unnecessary duplications: long results
Indels / SNPs caused by assembler
check for self consistency
● Align all the reads back to the contigs
● Look for inconsistencies
总结:
○ sequencing bias under represents certain regions
○ Reads are short relative to genome size
○ Repeats create tangled hubs in the assembly graph
○ Sequencing errors cause detours and bubbles in the assembly graph
Phylogenetic/,faɪlədʒɪ'netɪk/ trees
clustering methods
trees
root -> branch -> node -> tip
unrooted tree
Parsimony /'pɑːsɪmənɪ/ methods:
optimal tree[NP-hard] <- minimize the number of changes on the branches
Sankoff algorithm(DPA)
score a tree/ calculate parsimony
minumum cost at the root
Combinatorics /,kɒmbɪnə'tɒrɪks/ and heuristics:
find the best tree
rearrange trees by breaking and reattaching branches
efficient <- recursive Sankoff's algorithm
sequential addition
choose several initial nodes
add the fourth node to each node and calculate the score (prioritize) [use shortcuts like not recomputing unchanged nodes]
input sequences/
character
data
output: an optimal but not necessarily the right tree
changes include the changed character and the position
branch length
= the number of changes in each node
Outgroups
: the most diverged from the Old World monkeys and Apes
Distance methods
UPGMA [unweighted pair group method with averaging] algorithm
unequally weighted
pro: rooted tree
cons: bad assumption
Assumption: all branches have the same evolutionary rate -molecular clock
Often a really bad assumption in real life because it assumes that all branches have the same evolutionary rate and they don't. But it works really
well
for
closely related species or genes
.
Ultrameric trees: all branch lengths between root and tip are the same
Resulting tree is not guaranteed to be optimal but it is efficiently computable
O(n^3) unoptimized -> O(n^2) optimized
Neighbor Joining algorithm
pro: no assumption
con: not rooted tree
watch out negative branch length
Guaranteed to recover the correct tree if the distances exactly reflect a tree
O(n^3) unoptimized -> O(n^2) optimized
K-mers
Alignment-free sequence comparison
when:
Species identification/detecting contamination in sequencing experiments
de novo assembly
Repeat and motif启动子 detection
Why: Usually much faster than alignment but much less sensitive
How: Count the fraction of unique(与别的sequences 的Kmers不重合的) k-mers
goal: find a good[not optimal] tree by combining closely related tips
input: a distance between each gene/species/sequence at the tips/ distance matrix:
alignment scores
k-mer distances
immunological distances[越大说明物种亲属关系更疏远(more distant related),越小说明more close related]
output
may not be the right tree, too. But it would be good enough and faster. The right tree: tells you what happened (over the course of evolution) 进化过程中.
Newick file format
Clustering
not require labels
Good for discovering patterns;
human-aided interpretation
Hierarchical clustering -
agglomerative algorithm: heatmap
gives all possible numbers of clusters[for given distance metric and linkage method]. -> slice the tree at what level you're interested
deterministic-> with the outcome determined by how we define the distance between clusters.[given distance matrix]
time complexity: O(n^2) n- total objects number
Can never undo what was done previously - i.e. very sensitive to input order; can't add a new point later
distance matrix/ dissimilarity matrix
Given a set of N items to be clustered, and an N*N distance (or similarity) matrix:
Start with N clusters (each with only one item).
Recompute the distances between clusters.(just need to compute distance from new cluster to other clusters)
Single linkage: mind(a,b)
complete linkage: maxd(a,b)
average linkage: UPGMA/ WPGMA
Ward linkage: merge that optimize the objective function
Repeat steps 2 and 3 until all items are in one cluster.
Merge the two clusters that are closest.
choose a cutoff threshold
Dendrograms /'dendrəgræm/[生物] 系统树图(表示亲缘关系的树状图解)
shows the hierarchical relationship between
clustered items.
e.g. phylogenetic relationship between samples
heatmaps
normalise rows
Loses absolute expression information. But highlights relative expression of each gene across samples.
when:
● simultaneously seeing clustering of genes, and of samples
● visualising how these clusters correspond to one another
● seeing which genes are upregulated vs downregulated in which
groups
Distances and dissimilarities
matrix/function:
Euclidean/ Manhattan
edit distance: Hamming/ Levenshtein/ Longest common substring (LCS) distance: single-letter insertions or deletions
k-mer distance
K-means clustering
k: meta-parameter
is sensitive to outliers
must be convex- cannot handle arbitary shapes
minimise the objective/cost function - represents the within-cluster variance
objective function optimisation
walking downhill <- guaranteed to
converge
If the objective function is convex, then an algorithm that decreases it at each step must eventually reach the global minimum.
many minima -> local minima, especially in high dimensions
all partitions - O(NK)
Lloyd's algorithm- good, simple and fast
O(NKdi)
d- dimensionality
i-iterations
N item for K clusters
expectation-maximisation (EM) algorithm: we alternately vary each of cluster centroids and point assignment, while
holding the other parameter fixed
1.Place K points into the space represented by the objects that are being clustered. These points represent initial group centroids.
2.Assign each object to the group that has the closest centroid. 3.When all objects have been assigned, recalculate the positions of the K centroids.
4.Repeat Steps 2 and 3 until the centroids no longer move.
Not
guaranteed to converge to the
optimal
solution, but efficient.
guaranteed to
converge
- decreasing objective function
solutions may differ based on different centroid initializations
problem
poor initialisation of the centroids-> slow convergence/ convergence to local minimum
K-means++: choose initial centroids points which are far from centroids chosen so far
whether global or local minimum
run k-means multiple times [with different initialisation] to check for stable converages
choose K
K 越大,结果越好,越converges, but not feasible
try to choose the K where objective function starts to "sharply increase"-取convex最低的位置
○ may use validation to select meta-parameters like K
fails:
Clusters of different size -> k-means will try to get all clusters as the same size
Clusters of different density
Non-convex/ non-globular/ non-spherical clusters/shapes
solution: simply increase K
K-medoids
instead of the mean of the points, take the most central data point
but same objective function as K-means
● Assign points to the nearest centroid as for k-means
● Update centroids by choosing the most central point. Commonly, choose a point in the cluster randomly and make it the centroid if it reduces the objective function.
● Less sensitive to outliers than k-means
(just as medians are less sensitive to outliers than means)
● Only need to know pairwise distances - don't need to compute distance to some arbitrary centroid.
● Slower to calculate - roughly like O(kN2) instead of O(kN)
K-medians: take the median in each dimension but can be data point or not. less sensitive to outliers than k-means
kernel K-means: can detect non-convex clusters in original space but not use objective function distance
parameters/ hyperparameters or meta-parameters
minimize intra-cluster distances
maximize inter-cluster distances
application: document search, image compression, gene expression,heatmaps
properties: exclusive/ overlapping/ probablistic/ partitional/ hierarchical
Clusters validation
external validation: Compare the results of a cluster analysis to externally known class labels (ground truth)
Internal Validation: Evaluate how well the results of a cluster analysis fit the data without reference to external information
types: Density-based[distance, arbitrarily-shaped clusters/ make use of points distances]/probabilistic/Network-based[graph instead of distances; clusters densely connected]
Network analysis & graph theory
when
higher-order structure of data
phenomena appears due to interactions -> graph and networks needed
analysis experimental data
networks:
labelled graphs
G=(V,E)
V- vertices
E- edges
order
-number of vertices, or nodes/entities/points
Edges: labelled/direction(
arcs
)/weights/none above
size
- number of edges
walk/path
: start and end at a vertex
may contain cycles or be acyclic
graph representation
• Objects – explicitly as a set of vertices and edges
• Adjacency/ə'dʒeɪsənsɪ/ 毗邻 matrix
• Adjacency lists: reduce storage spaces
Sparse graphs are graphs with few edges;
these fill matrices with huge numbers of zeros
Graph topology
concepts
degree
: the number of edges incident on the vertex
The
outdegree
is the number of outbound edges
or arcs.
The
indegree
of a vertex is the number of inbound
edges or arcs
The
connectivity
of an element is the minimum number of edge or vertex deletions required to disconnect that element from the remaining nodes.
edge/node connectivity
subgraph/clique/k-clique[order k]
neighbourhood
local clustering coefficient
network average clustering coefficient
cancer genes: hub genes are important failure points in cancers
degree distribution
scale free graph: degree distribution follows a power law
Growth:
Preferential attachment
Scale-free organisation provides mechanisms for biological
robustness and error tolerance
many biological networks are scale-free
degree distribution with probability
random graphh
small-world graph: vertices connect via short path; then rewired with probability
heterogeneity/ randomness/ modularity
how to map biological problems onto graphs
Weighted Gene Coexpression Network Analysis- WGCNA
motivation
Detect sets of genes with shared expression patterns; this will allows us to find these groups of functionally related genes.
Steps
Define a gene co-expression similarity
Pearson correlation unsinged-> similarity matrix
we are only interest in co-expression/co-regulation when forming modules. As such, the type of co-expression, i.e. activation/inhibition is irrelevant
define a Family of adjacency functions
decide whether each pair of genes is co-expressed based on adjacency function
hard thresholding: binary
soft threshold: power thresholding,diagonal = 0
determine the AF[adjacency function] parameters
want a scale free network conditions:
• that yields a scale-free like topology of the resulting network
• that prevents the resulting network from having few connections
Use the first parameter where saturation reached >0.8
define a measure of node dissimilarity
build dissimilarity matrix based on the similarity function
1-similarity score
identify network modules/clustering
module: groups of nodes where expression profiles are highly correlated; high topological overlap
cluster the entries in the dissimilarity matrix
commonly hierarchical clustering, then cut at the chosen height
6.related network concepts
to each other
Each module has a characteristic eigengene – an expression profile that is ‘typical’ of the module
• Perform PCA on the module’s expression, eigengene = first PC
• Merge modules if their eigengenes are similar
relate concepts to external gene or sample info
• Compute network statistics within the modules,
then test for association with experimental outcomes
DNA made from four possible nucleotides/bases: 核苷酸(ACGT): complementary互补的/double-stranded, stable info storage, replicate easy
chromosomes:染色体
protein: amino acid 氨基酸
codon-密码子,mRNA uses at least three bases to describe one amino acid 4^3 种
Fluorescence /flʊə'res(ə)ns/ 荧光性
mitochondrial 线粒体
concatenate: multi-fastq
replicon 复制子
DNA-> transcription -> RNA -> translation -> Protein
DNA-sequencing
● Identification: ○ De novo sequencing ○ Exome capture sequencing
● Quantification: ○ RNA-seq ○ ChIP-seq ○ Barcode/tag counting experiments
cut DNA into small pieces -> made a lot of copies -> read the sequences -> intepret the short sequencing fragments "reads"
FASTQ files
start symbol -> sequence ID -> sequence -> seperator line -> encoded quality values/ one symbol per nucleotide
● Assembly: try to reconstruct the original sequence
● Alignment: try to match reads to a known reference genome
High-throughput alignment, indexing
Goal: fast and feasible alignment of many strings to a reference genome(its subsequences can be found fast through indexing which is related to compression)
compression->less memory footprint内存占用
High-throughput alignment
DPA-O(m*LN) m-read length, L referene length, N reads
thus use high-throughput aligners
use a fast method to identify
likely
alignment sites
small
near-exact-match
substrings (seeds)
diagonal stretches in the alignment grid
Pro: near-exact-match:
much faster algorithm
Con: miss some good alignment if not near-exact seed match
seed-and-extend
find seed sites
index: 位置从0开始
fast lookup of seed strings in the reference genome
reference genome index can be
pre-computed
and
re-used
hash tables
quickly find
exact
matches
store k-mers: O(k)
does not grow with genome size.
hash function: lose info -> can't easily recover the original
4 more items...
collision
1 more item...
suffix indexing
suffix tree
time O(m^2)
2 more items...
suffix tries
containing all suffixes of T: m(m+1)/2 chars
a^m: m nodes with "a" + m+1 "$" = 2m+2 nodes
number of suffix trie nodes: m->m^2
6 more items...
suffix arrays: smaller but less informative
T is part of index; store the sorted indices
space: O(m)
lexicographic (alphabetical) order of T’s suffixes
substring must be a prefix of ≥1 of T’s suffixes
Suffixes sharing a prefix are consecutive in the suffix array -> binary search
1 more item...
Constant factor greatly reduced compared to suffix tree
FM index and the Burrows-Wheeler transform(BWT)
compressible: rotation
Reversible
first and last column
build index: time/space O(M) is acceptable
T - text/genome; length: M
P - pattern[read/part of read] in T; length: N
∑ - alphabet {A,C,G,T}
querying the index: O(1) ~ O(M) query
seeds are small subsequences with
few
or
zero
mismatches to the reference.
Seed feasibility
find matching seeds for (most of) our true read alignment positions
read length N, error E,
get an error-free partition: split string into E+1 partitions
at least seed size =
N/(E+1)
round down向下取整
problem: check all partitions of the read to use the error-free partition
number of error-free k-mers: t ≥ N-k+1-kE
k ≤ N+1-t/(E+1)
when t=1, seed size= N/(E+1)
● only perform the slower full alignment algorithm at these sites
● pick the best-scoring alignment(s)
dimensionality reduction
create new features instead of using existing ones
unsupervised
Why: Data visualization,Statistical analysis
● PCA (principal component analysis)
redundant dimensions: most of the variance lies along certain dimensions
Goal: Find a K-dimensional(new space) set of basis vectors such that, if we project {xi} into this subspace, as much as possible of the variance is retained[most accurately represent given data].
minimise the orthogonal[正交的,直角的] projection error
looks for
maximum-variance
projection[may show clustering but cannot indicates the existing of clustering]
principal components:
ordered with size;
uncorrelated with each other
how
covariance between each dimension -> eigenvector of the covariance matrix[basis vectors] -> choose K pc -> project data points
reversible: basis vectors -> reconstruct original data
application
visualization: 2-3 PC
compression: depends
batch effects removal- correct data
MDS (multi-dimensional scaling)
a descriptive technique, for visualising data
principal coordinates analysis
when
visualisation, discover data structure, outliers,interpolation
goal
embed data points in a lower-dimensional space.
Preserve original pairwise distances or dissimilarities(matrix).
how
objective function that measures the distortion of original to modified distances.-> minimize the objective function
stress/strain metric: measure how well lower-dimensional distances represent originals[error].
type: classical/metric[distance-based]/non-metric/generalised(from metric MDS) MDS
simplest: Gradient descent on the objective function(stress)
interpretation
● Exact position of points is not useful/meaningful
● Clusters often represent real clusters
● Outliers often represent real outliers
● Ordering along an axis may represent a real ordering
● The linear combination of variables which make up a principal component may be interesting
● t-SNE (t-distributed stochastic neighbor embedding)
Distances in a plot are not real distances, not PCA
The dimensions have no interpretable meaning (unlike PCA)
Hyperparameters need to be tuned carefully and they can have a huge impact on the result
t-SNE and UMAP are very powerful for visualization and data exploration but more than one look at the data is needed
5.clustering size means nothing
based on probability distributions
goal: keep points that close to each other closer the each other in low-D space -> preserve local structure/smaller pairwise differences/distances
tend to preserve the global structure
clusters distance has no meaning, but structure and points get together has meaning
exaggerate the cluster in origin dataset, be careful of overinterpretation
● UMAP (uniform manifold approximation and projection)
based on manifolds and topology rather than probability distributions.
UMAP has so far been more extensible because of the mathematical theory underpinning it
UMAP can use labels to help with dimension reduction UMAP has a very fast Python implementation using Numba
UMAP is an order of magnitude or more faster than t-SNE
algorithms: advantages and limitations
why,when and how to appy them