Please enable JavaScript.
Coggle requires JavaScript to display documents.
Lecture 5: Differences between microbial communities - Coggle Diagram
Lecture 5: Differences between microbial communities
dissimilarities between microbial communities → beta diversity
dissimilarity matrix
Jaccard index: J = a / (a + b + c)
where a = # of species shared
b = # of species unique to sample 1
c = # of species unique to sample 2
Jaccard distance / dissimilarity: D = 1 - J
Other dissimilarity measures: Bray-Curtis, Euclidean, Chi-square etc
visualization through
hierarchical clustering
non-metric dimensional scaling (NMDS)
principal component or coordinate analysis (PCA or PCoA)
PCA only uses Euclidean distance to show distance between samples, PCoA can use different distances (see other dissimilarity measures above)
differentially abundant features (e.g.: taxa, genes, functions)
Microbiome wide association studies to identify relationships between microbiome features
Statistical testing can reveal differentially abundant features (potential biomarkers) between groups of samples
classification/regression by supervised machine learning (predictive modeling)
Classification (binary or multiclass)
Decision trees
Random Forests
Bootstrap data: randomly draw a subset of j samples (with replacement)
generate decision trees using bootstrapped data and at each step, use a random subset of i features
evaluate random forest using samples not in bootstrap data ("out-of bag samples") -> count if classification is correct -> confusion matrix
ROC: thresholds for sensitivity - specificity trade-offs
AUC: 0.5 = not better than random, 1 = perfect classifier
Feature x sample matrix (ni x mj)
For all features xi, apply binary label
Calculate purity of each subset
Feature that maximizes purity of subsets is new node (test)
Repeat 1 - 3 for each node
If subset is (less) pure (than previous subset), node becomes a leaf (classification), else repeat 4 - 5
Gini impurity: 1 - P(h)^2 - P(d)^2
-> then estimate weighted mean
do not generalize well
-> generate random forests
ROC curves
Regression
Learning goals:
You can quantify differences between microbial community compositions
You can test differentially abundant features between different groups of samples
You can argue for and against different measures to account for the multidimensional nature of metagenomic data
You can explain the basic principles of decision trees and random forests
You can apply machine learning for binary and/or multi-factorial classifications of samples