Please enable JavaScript.
Coggle requires JavaScript to display documents.
Lecture 6: Reconstruction and annotation of microbial community genomes -…
Lecture 6: Reconstruction and annotation of microbial community genomes
Learning goals
understand the differences between the assembly of individual genomes and the assembly of metagenomes, and you can explain additional challenges for the latter
different abundance -> different levels of sequencing coverage -> need for depth normalization
Due to the uneven (and unknown) representation of the different organisms within a metagenomic mixture, simple coverage statistics can no longer be used to detect repetitive DNA segments in an individuals genome
unrelated genomes may contain nearly-identical DNA (inter-genomic repeats)
multiple individuals from a same species may harbor small genetic differences (strain variants)
explain the differences between composition- and taxonomy-guided methods of
binning metagenomic assemblies into metagenome-assembled genomes (MAGs)
know how to assess the quality of assembled metagenomes and MAGs
describe the process of annotating microbial genomes
assembly and reconstruction of individual genomes and metagenomes
sample DNA - shotgun sequencing -> metagenomic reads - assembly -> contigs - assembly -> scaffolds - binning -> MAGs
Step 1: data quality control, e.g.: phred scores
Step 2: assembly and scaffolding, e.g.: k-mer based graph construction and simplification
assembly
1) k-mer-based graph where nodes are k-mers, edges are k-1 overlaps
2) layout shortest Eulerian path that visits each edge once.
3) Assemble reads corresponding to the edges that are traversed on the Eulerian path
scaffolding
1) paired-end reads are aligned to contigs and orientations are determined
2) insert size determined and contig connectivity graph constructed
assembly quality often determined with N50 metric:
1) sort contigs by descending size
2) calculate total length of all contigs
3) add up lengths in descending order until sum >= 50% of total length
--> N50 = size of last contig added. the larger, the better
N50 ignores the real length of the genome, it estimates by summing up all contigs, which might be wrong
Step 3a: binning contigs/scaffolds - composition guided
1) for each contig collect read abundance and k-mer frequencies
2) combine both in distance matrix and resolve into clusers of highly correlated contigs
abundance correlation
across samples
means contigs likely to originate from same genome
co-abundant contigs
within a sample
likely to stem from same genome bec amplified at same rate
Example exam question: How does horizontal gene transfer change step 3a?
Step 3b: binning contigs/scaffolds – taxonomy guided (based on reference genomes)
use of clade-specific marker genes for binning metagenomic assemblies into draft genomes
Binning (metagenomics): the process of classifying reads into different groups or taxonomies.
taxonomic and functional annotation of metagenomes
Step 4: genome annotation - protein coding genes
ab-initio: based only on genomic DNA sequence
search for large ORFs
search for signifiers (specific sequence, codons, GC content)
homology-based
core- vs pangenome
Step 5: functional annotation of genes (assign function to gene)
different algorithms, databases (e.g.: COG, KEGG)
Summary
Reconstruction of microbial genomes from metagenomes is challenging as natural communities are complex (many co existing strains, uneven distribution of abundance)
Assembled contigs/scaffolds can be binned by composition and/or taxonomy-based approaches
Assembly metrics provide means of quality control and lineage-specific genes can reveal contaminations in metagenome assembled genomes
Functional annotation of genes/genomes can involve several search strategies against many different databases
Strain level differences between genomes of the same species are reflected in core and pan-genomes