Please enable JavaScript.

Coggle requires JavaScript to display documents.

Algorithms for functional genomics 2 (Molecular evolution modelling…

- - - - Remove low-quality cells (subset by column)
        
        Filter by library size
        
        Filter by number of expressed genes in each cell
        
        Examine Mitochondrial, Ribosomal or Spike-in proportions
      - Remove low-abundance genes (subset by row)
        
        Filter by average expression level
        
        Filter by expressed in at least n cells
        
        Use a less aggressive approach for studies involving rare cells
    - - Normalization: Compute a scaling (size) factor per sample -> sequences inbalance
        
        clustering analysis/ trajectory inference
    - - high-throughput sequencing -> gene expression in individual cells
      - barcodes/sequences
        UMIs(Unique Molecule Identifiers)
        
        modelling errors: unique -> percentile: remove 次数少<- error-prone
        
        FASTQ file -> BAM -> table of counts
        
        High-resolution, high-dimensional and high levels of noise[60~70% 0] - tall and thin
  - - - find the clustering that constains that sequences
    - - looking things which are changing in different groups of samples
- - - - independent steps, previous state no longer affect current state
      - transition matrix: transition probabilities
        
        independent steps -> rows and columns sum to 1
        
        P(s') = ∏P(Data|s): sequence 每一个base的转化的p相乘
        
        millions of characters + small probabilities -> underflow error& not accurate
        
        thus use log: log(P ( Data1-n | Model )) = ∑ log( P ( Datai | Model))
  - - - tree topology: consider one at a time
      - mutation rates: the number of changes
      - nucleotide frequencies: invariant/fix, estimated from data
      - divergence times: branch length
    - - number of terms: 4^n
        codon: 64^n
        
        independent subtree: not depends on above nodes -> can be cut off -> number of terms: n
    - - • Resample the data, taking individual columns from the alignment with replacement
        • Recalculate the full analysis with the new data: computationally expensive
        • Generates an empirical distribution that represents the impact of sampling
        • Assess the rank of your result in this distribution to generate a p/parameter value
        • Use ranks to generate confidence intervals[置信区间] of parameters
  - - - • Additional biological realism comes at the cost of additional parameters
        • Models can frequently be parameterised in different ways to implement the same underlying biological model
        • Parameterisations may give identical results or may impact scaling of other parameters
        • Eg same ratio of branches in a tree but different values
        • Use variants with fewest parameters for speed but beware
        impacts on interpretation and comparisons
- - - - confusion matrix
        
        imbalanced data: precision-recall curve
        
        ROC curve best used when: False positive rate vs sensitivity
        ● Class labels are balanced
        ● TP, FP, TN and FN are known
        PR curve best used when
        ● Class labels are imbalanced
        ● TN are not known
    - - training set/ test set
      - under-fitting:
        
        high bias
        
        under-parameterised (overly simple) model
      - Over-fitting (high variance)
        trained too well on a specific dataset and does not generalise
        
        Biased training ● Small training sets
        ● Over-parameterised (overly complex) model
        
        Solution:
        ● Get more data
        ○ Higher signal to noise ratio
        ○ Drown out fluctuations with the pattern we want to fit
        ● Use a less complex model
        ○ Model with only capacity to fit the pattern, not fluctuations
        ○ e.g. use less features - choose how many using validation
        ● Add explicit regularisation terms to model
        ○ Penalise higher-order polynomials, high weights, etc
        ○ Choose terms using validation
        ● Stop training earlier
        ○ If training is iterative, stop before we over-fit (use validation)
        ● Use noise to regularise
        ○ If training is iterative, add different random noise each iteration
        
        noise drowns fluctuations
      - training(learn parameter
        
        validation
        
        cross-validation
        
        optimise the learning process: meta-parameter
        
        testing(asses performance)
  - - - ● At some maximum tree depth. The maximum tree depth is a hyperparameter.
      - ● At some minimum node split size - do not split a node if it is below this size. This size is a hyperparameter.
      - ● When the impurity decrease will not surpass some fixed threshold. This threshold is a hyperparameter.
      - ● If splitting would result in leaves below some minimum leaf size. This minimum leaf size is a hyperparameter.
    - - ● Considering more candidate features leads each individual tree to be stronger (less bias)
      - ● Considering less candidate features leads the entire forest of trees to be less correlated (leading to less variance)