Please enable JavaScript.

Coggle requires JavaScript to display documents.

Algorithms for functional genomics (Mapping and alignment alignment…

- - - - banded alignment: O(n×d)
      - Keep track of which cell(s) was/were the origin of the best score(s) for each given cell, and use this information for traceback. This increases storage requirements by a constant factor (i.e. they are still O(N^2)).
        Or, during traceback, calculate which cells(s) could have been the origin of the best score(s) for each cell. This increases the computational cost of traceback by a constant factor (i.e. it is still O(N)).
  - - - more likely to insert (or delete) multiple bases than a single base multiple times.
      - an indel of length k , has cost
        w(k)= h + g×k
        Where h is the cost to open a gap
        g is the cost to extend a gap
      - used in BLAST, both local and global can use
      - O(n×m)
  - - - find where all the (short) reads/query fit on our reference genome, quickly and accurately.
      - Inexact alignment:
        ○ errors in the read
        ○ true differences between sample & reference
      - scoring alignments: >80% the number of matches/the length of read
    - - Match whole of the [short]query against part of [long]reference
        
        costly alignments on the entire reference genome
        
        Seed and Vote Alignment Strategy
        
        rely on the relatively low error rates in read data
        
        steps
        
        If most of the read sequence is correct and we do not expect too much genetic variation between reads and reference,
        
        Then we can reliably find subsequences within reads that match the reference perfectly.
        
        Use an index to the reference to quickly query “seeds” from the read.
        
        Align reads to seeded regions of the reference only.
    - - Two common sources of error
        ○ Sequencing error. < 1% on current hardware
        ○ Incorrect reference.
        ■ Poor reference assembly
        ■ Reference is for a related but different organism
        ■ Same organism but individuals are different!
        ■ Cancer vs normal tissue
      - many shorts vs. few longs
        Tools: BWA...
      - ambiguous alignment: one short read can fit on several places of the reference
        
        Multiple-mapping reads
        
        ● Align to all possible places ○ useful in some situations ○ but belongs to only one "real" place
        
        ● Align to the first place you find ○ not a good idea... but some tools still do it
        
        ● Align to a random choice of all valid places ○ useful in some situations
        
        ● Don't use multiple-mapping reads
        ○ often necessary if calling SNPs
- - - - quality scores/value for each nucleotide:
        
        the measurement of reliability for each base
        
        derived from the physical process
      - quality encoding
        
        letters/symbols to represent numbers
        Q0-> 40: bad -> excellent
        10 -> 1 in 10 -> 90% bad
        20 -> 1 in 100 -> 99% maybe
        Ok/ very good/ excellent
        
        P = 10 ^ (-Q/10)
        Q= Phred quality score
        P = probability of base call being incorrect
        accuracy = 1-P
  - - - assembly graphs
        
        Directed graph
        ● Nodes (vertices): Reads
        ● Edges: Overlaps
        ● Paths: Connected reads → Contigs!
        
        Hamiltonian path: travel man
        NP-hard
    - - data amount
        
        coverage: aka breadth
        
        depth
        
        ○ sequencing errors
        ○ polymorphisms 多态性
        
        Random reads
        Coverage = 1 - e-Depth
        100% coverage may means part only have one evidence
        generally, 50% in real
      - Greedy Extension
        
        Compute all overlaps: O(N2) x O(L2)
        
        Find longest overlap and join them
        
        Repeat 2. until there are no more overlaps.
        
        Issues:
        ● Makes locally optimal decisions which may not be globally optimal.[也许该overlap不属于200长度而是属于50长度，尽管达到了local optimal，但是实际上可能不是正确位置]
        ● Computationally expensive
      - Sequencing by Hybridisation/,haɪbrɪdɪ'zeʃən/ 混合淡化技术(SBH)
        
        Sequencing method:
        
        ○ Put all possible k-mers on a chip (k=20-25)
        
        ○ Chop up target DNA and add fluorophores /'fluərə,fɔ:/
        
        ○ Hybridise DNA to chip
        
        ○ Measure fluorescence on each spot on chip
        
        ○ Get list of k-mers present in target DNA.
        
        problems
        
        ● Paths can be ambiguous due to low k values
        ● k-mers for k=5 are not that unique.
        Uniqueness comes at higher values of k.
        ● Number of probes (spots on a chip) required is 4^k
        ● Cross hybridisation is a problem
        【like G&C loves to stick with everything else, if lots of G/C,
        there would have lots of false fluorescence】
        ● Not all k-mers hybridise equally
      - K-mer de Bruijn[brown] Graph Assembly
        
        ● A directed graph
        ● Nodes in the graph are (k-1)-mers (the overlaps)
        ● Edges represent k-mers (which overlap by k-1 symbols)
        
        balanced/unbalanced nodes: incoming and outcoming interactions
        
        Graph traversal[ “Eulerian Walk”]:
        
        Start at an unbalanced node with Degree ≠ 0
        
        Walk through the graph, adding edges
        
        Stop when you hit another unbalanced node
        
        A directed, connected graph is Eulerian if and only if it has at most 2 semi-balanced nodes and all
        other nodes are balanced.
        
        Graph is connected if each node can be reached by some other node
        
        Eulerian walk visits each edge exactly once
        
        Not all graphs have Eulerian walks.
        
        Why
        
        why K-mers
        
        ● Not all reads will overlap another read
        ○ Read errors
        ○ Coverage “holes”
        
        ● Not all reads are the same length (depending on technology)
        
        no alignment - no overlap computation hit!
        min/max overlap = 4
        
        ● To be able to use de Bruijn graphs, we need reads of length L to overlap by L-1 bases.
        
        Why De Bruijn
        
        fast
        
        ● Make k-mers in O(Nr) time
        
        ● Eulerian pathing is fast!
        ○ Can make contigs in O(E) time. (E is # of edges)
        
        ● Using de Bruijn graphs is fast!
        ○ No alignments
        ○ Can “hash” k-mers to build graph
        ○ Lookups are fast
        ○ Can build graph in O(N) time N-total length of reads【for high average coverage and perfect data G<< N, O(min(N,G))】
        
        Hash Table
        
        steps
        
        Select a value for k
        
        "Hash" the reads (make the kmers)
        
        Count the kmers
        
        Make the de Bruijn graph
        
        Perform graph simplification steps - use cov cutoff
        
        Read off contigs from simplified graph
        
        Make the de Bruijn graph
        K = shortest read, generate unique k-mers list, then:
        
        Take the first k-mer
        
        Split it into a k-1 prefix and k-1 suffix
        
        Add both to graph as nodes (if required: not repeated)
        
        Add a directed edge to the graph connecting them
        
        Label this edge as the k-mer
        
        problems
        
        repeats problem
        
        Larger k value has caused disconnections in the graph!
        but fewer loops/contigs
        
        ● Lower k
        ○ More connections
        ○ Less chance of resolving small repeats
        ○ Higher k-mer coverage
        
        ● Higher k
        ○ Less connections
        ○ More chance of resolving small repeats
        ○ Lower k-mer coverage
        
        ● Optimum value for k will balance these effects
        
        if only visit each node once, there would have several contigs
        
        read errors
        fork road
        
        solution: more coverage
        
        • Errors won’t be duplicated in every read
        • Most reads will be error free
        • We can count frequency of k-mers
        • Use frequency to clean the de Bruijn graph
        
        Coverage Cutoff- parameter
        【coverage 99 vs 1 errors or 50 vs 50 diploid】
        
        why
        
        4 more items...
        
        Ck = C (L - k + 1)/ L
        K-mer coverage depth
        C-coverage depth
        L- read length
        
        Genomes Assembly
        
        double stranded DNA
        
        bottom strand and top strand are reverse complement
        
        When we add them to the graph, we add both the forwards and reverse complement of the prefix/suffix.
        
        Take the first k-mer
        
        Split into prefix/suffix
        
        Add them as nodes
        
        Add the revcom to nodes
        
        Add kmers (forwards and reverse complement) as edges
    - - • Squash small bubbles – collapse small errors (or minor heterozygosity)
        • Remove spurs – short “dead end” hairs on the graph[nothing else matches it, then it may be error-prone and can be cut off]
        • Join unambiguously connected nodes
        – reliable stretches of unique DNA
      - remove contained strings.
        “Not, look” is fully self contained within the other longer read, and can be removed.
    - - • For each unconnected graph – at least one per replicon in original sample
      - • Find a path which visits each node once
        – Hamiltonian path/cycle is NP-hard (this is bad)
        – solution will be a set of paths which terminate at decision points
      - • Form consensus sequences from paths
        – use all the overlap alignments
        – each of these collapsed paths is a contig
        
        contigs: Contiguous, unambiguous stretches of assembled DNA sequence
        
        Ends:
        
        Real ends (for linear DNA molecules)
        
        Dead ends (missing sequence)
        
        Decision points (forks in the road)
        
        sequences【may contain errors】-> a few long sequences/ contigs[based on overlaps]
        Ideally, one sequence per replicon复制子
        
        if error -> majority consensus
        if not have majority, stop to break it
  - - - repeats
        
        type: tandem串联/ interspersed分散
        
        solution: Long reads can span repeats
        
        It is impossible to resolve repeats of length L unless you have reads longer than L.
  - - - desire: ○ Fewer contigs ○ Longer contigs
      - ○ Number of contigs
        ○ Average contig length
        ○ Median contig length
        ○ Maximum contig length
        ○ “N50”, “NG50”, “D50”
        
        N50: 50% length of genome
        
        从最长的contigs数目开始加，知道加到大于等于50%G
        
        may be produce assembly errors that merge contigs that not should be merged together
    - - [0,1]
        Assembled Genome Size/ estimated Genome Size
      - estimated Genome size:
        based on known assembled Core Genes thought to be present in a wide variety of organisms
        
        Number of Core Genes in Assembly / Number of Core Genes in Database
    - - Proportion of the assembly that is free from mistakes
        
        Mis-joins
        
        Repeat compressions- shorten
        
        Unnecessary duplications: long results
        
        Indels / SNPs caused by assembler
      - check for self consistency
        
        ● Align all the reads back to the contigs
        ● Look for inconsistencies
- - - - sequential addition
        
        choose several initial nodes
        
        add the fourth node to each node and calculate the score (prioritize) [use shortcuts like not recomputing unchanged nodes]
  - - - Assumption: all branches have the same evolutionary rate -molecular clock
        
        Often a really bad assumption in real life because it assumes that all branches have the same evolutionary rate and they don't. But it works really well for closely related species or genes.
      - Ultrameric trees: all branch lengths between root and tip are the same
        
        Resulting tree is not guaranteed to be optimal but it is efficiently computable
        
        O(n^3) unoptimized -> O(n^2) optimized
    - - Guaranteed to recover the correct tree if the distances exactly reflect a tree
      - O(n^3) unoptimized -> O(n^2) optimized
    - - when:
        
        Species identification/detecting contamination in sequencing experiments
        
        de novo assembly
        
        Repeat and motif启动子 detection
      - Why: Usually much faster than alignment but much less sensitive
      - How: Count the fraction of unique(与别的sequences 的Kmers不重合的) k-mers
  - - - gives all possible numbers of clusters[for given distance metric and linkage method]. -> slice the tree at what level you're interested
        
        deterministic-> with the outcome determined by how we define the distance between clusters.[given distance matrix]
        
        time complexity: O(n^2) n- total objects number
        
        Can never undo what was done previously - i.e. very sensitive to input order; can't add a new point later
      - distance matrix/ dissimilarity matrix
        
        Given a set of N items to be clustered, and an N*N distance (or similarity) matrix:
        
        Start with N clusters (each with only one item).
        
        Recompute the distances between clusters.(just need to compute distance from new cluster to other clusters)
        
        Single linkage: mind(a,b)
        
        complete linkage: maxd(a,b)
        
        average linkage: UPGMA/ WPGMA
        
        Ward linkage: merge that optimize the objective function
        
        Repeat steps 2 and 3 until all items are in one cluster.
        
        Merge the two clusters that are closest.
        
        choose a cutoff threshold
      - Dendrograms /'dendrəgræm/[生物] 系统树图（表示亲缘关系的树状图解）
        shows the hierarchical relationship between
        clustered items.
        e.g. phylogenetic relationship between samples
      - heatmaps
        
        normalise rows
        Loses absolute expression information. But highlights relative expression of each gene across samples.
        
        when:
        ● simultaneously seeing clustering of genes, and of samples
        ● visualising how these clusters correspond to one another
        ● seeing which genes are upregulated vs downregulated in which
        groups
    - - matrix/function:
        
        Euclidean/ Manhattan
        
        edit distance: Hamming/ Levenshtein/ Longest common substring (LCS) distance: single-letter insertions or deletions
        
        k-mer distance
    - - minimise the objective/cost function - represents the within-cluster variance
        
        objective function optimisation
        
        walking downhill <- guaranteed to converge
        
        If the objective function is convex, then an algorithm that decreases it at each step must eventually reach the global minimum.
        
        many minima -> local minima, especially in high dimensions
      - all partitions - O(NK)
      - Lloyd's algorithm- good, simple and fast
        O(NKdi)
        d- dimensionality
        i-iterations
        N item for K clusters
        
        expectation-maximisation (EM) algorithm: we alternately vary each of cluster centroids and point assignment, while
        holding the other parameter fixed
        
        1.Place K points into the space represented by the objects that are being clustered. These points represent initial group centroids.
        2.Assign each object to the group that has the closest centroid. 3.When all objects have been assigned, recalculate the positions of the K centroids.
        4.Repeat Steps 2 and 3 until the centroids no longer move.
        
        Not guaranteed to converge to the optimal solution, but efficient.
        
        guaranteed to converge- decreasing objective function
        
        solutions may differ based on different centroid initializations
      - problem
        
        poor initialisation of the centroids-> slow convergence/ convergence to local minimum
        
        K-means++: choose initial centroids points which are far from centroids chosen so far
        
        whether global or local minimum
        
        run k-means multiple times [with different initialisation] to check for stable converages
        
        choose K
        
        K 越大，结果越好，越converges， but not feasible
        
        try to choose the K where objective function starts to "sharply increase"-取convex最低的位置
        ○ may use validation to select meta-parameters like K
      - fails:
        
        Clusters of different size -> k-means will try to get all clusters as the same size
        
        Clusters of different density
        
        Non-convex/ non-globular/ non-spherical clusters/shapes
        
        solution: simply increase K
      - K-medoids
        
        instead of the mean of the points, take the most central data point
        but same objective function as K-means
        
        ● Assign points to the nearest centroid as for k-means
        ● Update centroids by choosing the most central point. Commonly, choose a point in the cluster randomly and make it the centroid if it reduces the objective function.
        
        ● Less sensitive to outliers than k-means
        (just as medians are less sensitive to outliers than means)
        ● Only need to know pairwise distances - don't need to compute distance to some arbitrary centroid.
        ● Slower to calculate - roughly like O(kN2) instead of O(kN)
      - K-medians: take the median in each dimension but can be data point or not. less sensitive to outliers than k-means
      - kernel K-means: can detect non-convex clusters in original space but not use objective function distance
    - - application: document search, image compression, gene expression,heatmaps
      - properties: exclusive/ overlapping/ probablistic/ partitional/ hierarchical
    - - external validation: Compare the results of a cluster analysis to externally known class labels (ground truth)
      - Internal Validation: Evaluate how well the results of a cluster analysis fit the data without reference to external information
- - - - order -number of vertices, or nodes/entities/points
      - Edges: labelled/direction(arcs)/weights/none above
        size- number of edges
      - walk/path: start and end at a vertex
        may contain cycles or be acyclic
      - graph representation
        
        • Objects – explicitly as a set of vertices and edges
        • Adjacency/ə'dʒeɪsənsɪ/ 毗邻 matrix
        • Adjacency lists: reduce storage spaces
        
        Sparse graphs are graphs with few edges;
        these fill matrices with huge numbers of zeros
    - - concepts
        
        degree: the number of edges incident on the vertex
        
        The outdegree is the number of outbound edges
        or arcs.
        
        The indegree of a vertex is the number of inbound
        edges or arcs
        
        The connectivity of an element is the minimum number of edge or vertex deletions required to disconnect that element from the remaining nodes.
        
        edge/node connectivity
        
        subgraph/clique/k-clique[order k]
        
        neighbourhood
        local clustering coefficient
        network average clustering coefficient
        
        cancer genes: hub genes are important failure points in cancers
        
        degree distribution
        
        scale free graph: degree distribution follows a power law
        
        Growth:
        
        Preferential attachment
        
        Scale-free organisation provides mechanisms for biological
        robustness and error tolerance
        
        many biological networks are scale-free
        
        degree distribution with probability
        
        random graphh
        
        small-world graph: vertices connect via short path; then rewired with probability
        
        heterogeneity/ randomness/ modularity
  - - - Define a gene co-expression similarity
        
        Pearson correlation unsinged-> similarity matrix
        
        we are only interest in co-expression/co-regulation when forming modules. As such, the type of co-expression, i.e. activation/inhibition is irrelevant
      - define a Family of adjacency functions
        
        decide whether each pair of genes is co-expressed based on adjacency function
        
        hard thresholding: binary
        
        soft threshold: power thresholding,diagonal = 0
      - determine the AF[adjacency function] parameters
        
        want a scale free network conditions:
        • that yields a scale-free like topology of the resulting network
        • that prevents the resulting network from having few connections
        
        Use the first parameter where saturation reached >0.8
      - define a measure of node dissimilarity
        
        build dissimilarity matrix based on the similarity function
        1-similarity score
      - identify network modules/clustering
        
        module: groups of nodes where expression profiles are highly correlated; high topological overlap
        
        cluster the entries in the dissimilarity matrix
        
        commonly hierarchical clustering, then cut at the chosen height
      - 6.related network concepts
        
        to each other
        
        Each module has a characteristic eigengene – an expression profile that is ‘typical’ of the module
        
        • Perform PCA on the module’s expression, eigengene = first PC
        
        • Merge modules if their eigengenes are similar
        
        relate concepts to external gene or sample info
        
        • Compute network statistics within the modules,
        then test for association with experimental outcomes
- - - - start symbol -> sequence ID -> sequence -> seperator line -> encoded quality values/ one symbol per nucleotide
- - - - use a fast method to identify likely alignment sites
        
        small near-exact-match substrings (seeds)
        
        diagonal stretches in the alignment grid
        
        Pro: near-exact-match:
        much faster algorithm
        Con: miss some good alignment if not near-exact seed match
        
        seed-and-extend
        
        find seed sites
        
        index: 位置从0开始
        
        fast lookup of seed strings in the reference genome
        
        reference genome index can be pre-computed and re-used
        
        hash tables
        
        quickly find exact matches
        
        store k-mers: O(k) does not grow with genome size.
        
        hash function: lose info -> can't easily recover the original
        
        4 more items...
        
        collision
        
        1 more item...
        
        suffix indexing
        
        suffix tree
        time O(m^2)
        
        2 more items...
        
        suffix tries
        containing all suffixes of T: m(m+1)/2 chars
        a^m: m nodes with "a" + m+1 "$" = 2m+2 nodes
        number of suffix trie nodes: m->m^2
        
        6 more items...
        
        suffix arrays: smaller but less informative
        
        T is part of index; store the sorted indices
        space: O(m)
        
        lexicographic (alphabetical) order of T’s suffixes
        
        substring must be a prefix of ≥1 of T’s suffixes
        
        Suffixes sharing a prefix are consecutive in the suffix array -> binary search
        
        1 more item...
        
        Constant factor greatly reduced compared to suffix tree
        
        FM index and the Burrows-Wheeler transform(BWT)
        
        compressible: rotation
        
        Reversible
        
        first and last column
        
        build index: time/space O(M) is acceptable
        T - text/genome; length: M
        P - pattern[read/part of read] in T; length: N
        ∑ - alphabet {A,C,G,T}
        querying the index: O(1) ~ O(M) query
        
        seeds are small subsequences with few or zero mismatches to the reference.
        
        Seed feasibility
        find matching seeds for (most of) our true read alignment positions
        
        read length N, error E,
        
        get an error-free partition: split string into E+1 partitions
        
        at least seed size = N/(E+1) round down向下取整
        
        problem: check all partitions of the read to use the error-free partition
        
        number of error-free k-mers: t ≥ N-k+1-kE
        k ≤ N+1-t/(E+1)
        when t=1, seed size= N/(E+1)
      - ● only perform the slower full alignment algorithm at these sites
      - ● pick the best-scoring alignment(s)
- - - - covariance between each dimension -> eigenvector of the covariance matrix[basis vectors] -> choose K pc -> project data points
        
        reversible: basis vectors -> reconstruct original data
        
        application
        
        visualization: 2-3 PC
        compression: depends
        
        batch effects removal- correct data
  - - - visualisation, discover data structure, outliers,interpolation
    - - embed data points in a lower-dimensional space.
        Preserve original pairwise distances or dissimilarities(matrix).
    - - objective function that measures the distortion of original to modified distances.-> minimize the objective function
      - stress/strain metric: measure how well lower-dimensional distances represent originals[error].
  - - - UMAP has so far been more extensible because of the mathematical theory underpinning it
      - UMAP can use labels to help with dimension reduction UMAP has a very fast Python implementation using Numba
      - UMAP is an order of magnitude or more faster than t-SNE