Please enable JavaScript.
Coggle requires JavaScript to display documents.
Algorithms for functional genomics 2 (Molecular evolution modelling…
Algorithms for functional genomics 2
Single-cell RNA-seq analysis
assessing cell-to-cell heterogeneity,
identifying new cell types/clusters,
differential expression
pre-processing scRNA-seq data
quality control
care: activation/Inhibition less transciption
Remove
low-quality
cells (subset by column)
Filter by
library size
Filter by
number of expressed genes
in each cell
Examine Mitochondrial, Ribosomal or Spike-in proportions
Remove
low-abundance
genes (subset by row)
Filter by
average expression level
Filter by expressed in
at least n
cells
Use a less aggressive approach for studies involving rare cells
normalization & dimension reduction
Normalization: Compute a scaling (size) factor per sample -> sequences inbalance
clustering analysis/ trajectory inference
read alignment & quantification
high-throughput sequencing -> gene expression in individual cells
barcodes/sequences
UMIs(Unique Molecule Identifiers)
modelling errors: unique -> percentile: remove 次数少<- error-prone
FASTQ file -> BAM -> table of counts
High-resolution, high-dimensional and high levels of noise[60~70% 0] - tall and thin
downstream analysis
clustering
find the clustering that constains that sequences
differential expression(DE) analysis
looking things which are changing in different groups of samples
Molecular evolution modelling
likelihood model
model changes: parameters{tree length} -> probabilities changes
Can compare probabilities to select model that is the best fit for the dataa
Markov process
independent steps, previous state no longer affect current state
transition matrix: transition probabilities
independent steps -> rows and columns sum to 1
P(s') = ∏P(Data|s): sequence 每一个base的转化的p相乘
millions of characters + small probabilities -> underflow error& not accurate
thus use log: log(P ( Data1-n | Model )) = ∑ log( P ( Datai | Model))
application to trees
parameters
tree topology: consider one at a time
mutation rates
: the number of changes
nucleotide frequencies: invariant/fix, estimated from data
divergence times
: branch length
pruning trees
number of terms: 4^n
codon: 64^n
independent subtree: not depends on above nodes -> can be cut off -> number of terms: n
Bootstrapping: hard to nested
• Resample the data, taking individual columns from the alignment with replacement
• Recalculate the full analysis with the new data: computationally expensive
• Generates an empirical distribution that represents the impact of sampling
• Assess the rank of your result in this distribution to generate a p/parameter value
• Use ranks to generate confidence intervals[置信区间] of parameters
fitting the models
large search space-> no exhaustively
optimisation methods:
gradient descent/hill climbing: Newton-Raphson
simulated annealing
example models
Jukes-Cantor: the same transition probabilities for AA,AC,AG...
• Additional biological realism comes at the cost of additional parameters
• Models can frequently be parameterised in different ways to implement the same underlying biological model
• Parameterisations may give identical results or may impact scaling of other parameters
• Eg same ratio of branches in a tree but different values
• Use variants with fewest parameters for speed but beware
impacts on interpretation and comparisons
Kimura 2 Parameter: Purines/pyrimidines transition:α, pyrimidines & purines : β, ratio: K(become one parameter, 一个为1,另一个为K)
feneral time reversible (GTR)
The previous models are special cases of this model
Empirical Models
large number of parameters: protein sequence
fixed probabilities
Use alignments of lots of proteins to estimate that rate at which changes are evolutionarily accepted
Multi character models
uses groups of an arbitrary number of sites
mested models
Models are considered nested when fixing a parameter in the more highly parameterised model will give the
alternative model
In general use more highly parameterised models should only be used when they can be shown to improve the fit
to the data
statistical hypothesis testing
Machine learning in biology and classification
linear and logistic regression
decision boundary
p=0.5: logistic regression can only perfectly separate linearly
separable classes.
abnormal points
take on extreme values
not sensitive to outliers
machine learning
measuring performance
confusion matrix
imbalanced data: precision-recall curve
ROC curve best used when: False positive rate vs sensitivity
● Class labels are balanced
● TP, FP, TN and FN are known
PR curve best used when
● Class labels are imbalanced
● TN are not known
training procedures
training set/ test set
under-fitting:
high bias
under-parameterised (overly simple) model
Over-fitting (high variance)
trained too well on a specific dataset and does not generalise
Biased training ● Small training sets
● Over-parameterised (overly complex) model
Solution:
● Get more data
○ Higher signal to noise ratio
○ Drown out fluctuations with the pattern we want to fit
● Use a less complex model
○ Model with only capacity to fit the pattern, not fluctuations
○ e.g. use less features - choose how many using validation
● Add explicit regularisation terms to model
○ Penalise higher-order polynomials, high weights, etc
○ Choose terms using validation
● Stop training earlier
○ If training is iterative, stop before we over-fit (use validation)
● Use noise to regularise
○ If training is iterative, add different random noise each iteration
noise drowns fluctuations
training(learn parameter
validation
cross-validation
optimise the learning process: meta-parameter
testing(asses performance)
Naive Bayes Classifier
SVM
1/|w|
find the maximum-margin hyperplane
maximise perpendicular distance
linear separable
decision trees and random forests[can be hard to interpret]
suprisingly effective
implicit feature selection
parameter-free
both categorical and numeric data
robust to outliers
Provide error estimates "for free"
recursively partition parameter space
splitting:
impurity
stopping
● At some maximum tree depth. The maximum tree depth is a hyperparameter.
● At some minimum node split size - do not split a node if it is below this size. This size is a hyperparameter.
● When the impurity decrease will not surpass some fixed threshold. This threshold is a hyperparameter.
● If splitting would result in leaves below some minimum leaf size. This minimum leaf size is a hyperparameter.
random forest(can vote):
bagging: Use a
random sample of the data
with replacement
splitting select
a random subset
of the features(variables)
bias and variance
● Considering more candidate features leads each individual tree to be stronger (less bias)
● Considering less candidate features leads the entire forest of trees to be less correlated (leading to less variance)