Please enable JavaScript.
Coggle requires JavaScript to display documents.
topic 1: the dynamics of microbial genomes within populations - Coggle…
topic 1: the dynamics of microbial genomes within populations
the evolution and diversity of bacterial genomes
PHYLOGENETIC TREE
they lie! they are a statistical representation of an evolutionary model
early on in evolution there was lots of gene swapping between bacterial, archaean and eukaryotic branches -> destroys a lot of evolutionary assumptions
root in centre = LUCAa
trees are 3D not 2D - you can spin them around
no information in the final order of the taxa - focus on the lengths of the branches
DIVERSITY
somewhere between 0.8-1.6million bacterial "species" but only 20,000 are named
in analysing 1.7 billion 16S ribosomal RNA amplicon sequences in the V4 hyper variable region obtained from 492 studies worldwide
recover 739880 prokaryotic operational taxonomic units (OTUs - 16S V4 gene clusters at 97% similarity - commonly used measure of microbial richness)
bacteria have very different 'species' and barriers -> get lots of flow of genes between 'species' so OTU is more relevant
estimate that there exist globally about 0.8-1.6 prokaryotic OTUs of which we recovered somewhere between 47%-96%
gram positive vs. gram negative
basic structural difference that has great taxonomic bearing
gram positive have thick layer of peptidoglycan
can have high GC gram pos and low GC gram pos
MRSA superbug
anthrax
gram negative have thin peptidoglycan layer and outer plasma membrane shielding peptidoglycan
synteny -> gene order
commonly under natural selection
where genes are in the chromosome determines how much they are expressed
proteobacteria = many important pathogens
give rise to mitochondria
cyanobacteria = C, O, N cycles
give rise to chloroplasts
HOW DO TWO BACTERIAL GENOMES VARY?
within species
arrangements
indels
portmanteau because polarisation is required
mutation rate
hypermutation
number of plasmids
DNA modification
allelic diversity
looking at orthologs and counting different alleles
relevant for both
genome size
gene content diversity (MGEs)
gene order (rearrangements)
rates of recombination (population structure)
lots of different ways to answer this question
it depends on the phylogenetic scale of the comparison
within species can see different types of variation - more than more diverse genomes
between-species
genome "shape" (circular or linear)
number of chromosomes
GC content
FEATURES OF BACTERIAL GENOMES
GENOME SIZE
some genera have acquired a "megaplasmid" which has "imported" essential genes from the chromosome
gene transfer from chromosome to plasmid means cell is now dependent on plasmid for survival
500kb
some species have linear chorosome rather than circular eg. spirochaete Borrelia burgdorferi
species also contains multiple plasmids
Lyme disease
CODING DENSITY
amount of coding that's responsible for protein coding
GC CONTENT
refers to the proportion of bases that either G or C ie. (G+C) / (G+C+T+A)
soil dwelling bacteria have high gc content - also have big genomes which makes sense as soil environments are very variable and complex
medium GC content tend to be free-living proteobacteria
endosymbiotic proteobacteria have very low GC content
what causes the variation in GC content ("base composition")
selections view - the GC content of the genome reflects adaptation (eg. thermal stability of DNA, or prevention of T-T dimers due to UV)
mutations view - the GC content of the genome has no adaptive relevance but reflects underlying mutationial processes
gc content is determined mostly by mutation bias rather than by selection
not all mutations are equally likely eg. C->T mutations occur more frequently than any other mutation so mutation bias alone cannot explain GC content variaitiion
mutation bias
as there are four different bases, there are 12 possible types of mutation (as each base can change to one of the three other bases)
transitions: purine to purine mutations (G-A vv and T-C vv)
transversions (all the rest) - purine to pyrimidine
transitions are usually 3-4fold more common than transversions although there are 2x more transversions
cytosine can rapidly undergo lamination to produce uracil
will pair with A during replication leading to a C -> T transition
most common type of spontaneous mutation, although most mutations will be repaired by the enzyme uranyl-DNA glycosylase
G-A and C-T are much much higher proportionally - both push GC content downt
by applying what we know about mutation bias and pressures - seems to be pressure to lower the GC content
GC SKEW
proportional difference between G and C on a single strand
STRANDEDNESS
two strands of DNA - sense and antisense, leading and lagging, Watson and crick
sense always has genes
leading and lagging -> type of replication
DNA polymerase can only go in one direction
5 - 3' polymerase
to go along the lagging strand -> discontinuous replication
different in how two strands are replicated that account for funny patterns of symmetry
genes are oriented on leading strand
GENERATION OF DIVERSITY
FATE OF DIVERSITY
selection (natural selection)
directional
purifying
balancing
diversifying (negative frequency-dependent selection)
drift (completely stochastic force)
selective coefficient (s)
effective populations size (Ne)
mutations
SNPs
INDELS
rearrangements
recombination
homologous (allelic)
non-homologous (HGT)
the genomics revolution
considerably fewer studies about
archaea
as they are rarely if not never definitively pathogenic
motivation for genomic sequencing tends to be public health
technological advances have resulted in a genomics revolution
full genome data have revealed massive amounts of variation and new genes and have provided evidence as to the process which cause genomes to change over evolutionary time
data from complete genomes can provide detailed comparisons both
between and within
species
for many species (key pathogens) there are thousands of genome sequences available in public databases for population studies
electrophoretic variation
first method to actually allow stuff into bacterial morphs and phylogeny
there's so much variation in bacterial genetics that its very hard to explain and keep up with
lots of characteristics of bacterial genomics and epidemiology cuts to the core of this
horizontal gene transfer
salient example of how phylogenetic trees lie to you
helicobacter pylori
lives in stomach and tolerant of low pH
can be pathogenic
recombines A LOT - difficult to manage taxonomically
GENE CONTENT VARIATION
the core genome
all genomes within a species share this genomic material regardless of strain
all basic cellular functions
transcription, translation (informational genes)
cell envelope
key metabolic pathways (housekeeping genes)
DNA replication
DEFINED AS ALL THOSE GENES WHICH ARE PRESENT IN ALL STRAINS AT A GIVEN PHYLOGENETIC DISTANCE (species or genus)
the accessory genome
the unique portion for a specific strain
denotes wider function -> genes brought in by phages, plasmids etc.
typically use for interactions with outside environments
restriction/modification
restriction enzymes chop up DNA, modifications make sure the enzyme doesn't work on its own DNA
rudimentary immune system in bacteria
commonly carried on plasmids
pathogenicity
antibiotic resistance
secretion
secondary metabolism
GENES ARE PRESENT IN SOME STRAINS AND ABSENT IN OTHERS
tend to have a lower GC content than core genes, and are more often of unknown function
the pan-genome
all genomic material for all strains within a species including crossover and unique material
TOTAL POOL OF GENES
multiple genomes within a variety of strains
genes present in only now strain are known as ORFans
selection, adaptation and genetic drift
TYPES OF NATURAL SELECTION
directional (positive) selection drives the organism incrementally towards a given favoured trait (eg. antibiotic resistance, thermal tolerance)
trait moves in a particular direction
selection for one extreme and against the other extreme
purifying (aka stabilising) selection
negative selection working against mutations that lower fitness
take out the extremes at both ends of trait distribution)
optimum phenotype central to the distributiion, anything on either side has disadvantage
balancing (aka disruptive) selection
maintains multiple variants at set frequencies in the population
selection to maintain two different fitness peaks - can be one thing or the other but not both
negative frequency-dependent (aka diversifying) selection
in favour of rare variants - may result in oscillating frequencies (eg. coevolutionary dynamics)
as soon as you're rare you have an advantage allowing you to reporduce which makes you more common which gives you disadvantage: antigens have a similar issue
eg. avoidance of the immune response - red queen hypothesis - must keep running to stand still
strict darwinists see evolution being driven by selection either as an engine (advancement of an adaptations) or as a hard brake (the elimination of maladaptation)
nearly-neutralists see selection primarily as a soft brake - leading to filtering out of slightly deleterious mutations over time
not the case that most mutations are neutral but they are nearly neutral
polymorphism vs substitution vs. mutation
substitutions
fixed differences between populations (and thus species)
eg. AAA in e coli and GGG in salmonella
same thing conserved in population but different between: evolution of this is a lot slower
polymorphisms are "standing variation" within a population
may be fixed or removed from the population - variation falling within the boundary of a species
mutations refer to all de novo changes - their emergence is independent of the effect of selection
synonymous mutations
on average have weaker elective consequences than non synonymous mutations
often assumed to be selectively neutral - is this really true?
non synonymous mutations
the fate of an individual mutation will typically be determined by a mixture of selection and drift, except in those cases where selection is either exceptionally strong (either positive or negative), or exceptionally weak (approaching neutrality)
THE DISTRIBUTION OF FITNESS EFFECTS
there are four possible outcomes for a de novo mutation
advantageous
only 1% mutations are adaptive in the LTEE (artificial experiment, probably lower in nature)
strongly deleterious
neutral
slightly deleterious
by far the most common - doesn't work quite as well but still functions as needed
"selective sieve" - new mutations -> substitutions in lineage SOMETIMES
more effective with large census population and large selective coefficient
less effective with small census population size and small selective coefficient
the strength of selection acting on an individual mutation is dependent upon:
effective population size (Ne) - selection is stronger in big populations
refers to the amount of individuals that contribute variation to future generations
in bacteria differs from the census population size because a single clone of indencitcal cells can expand rapidly eg. over the course of an infection
can be dramatically reduced through population bottlenecks or founder effects
founder effects - diversity arises through a mutation, if it is in a new niche or particularly adaptive then you reset the diversity and suddenly have very little variation in a a population
the effective population size determines the strength of selection, because in a small population a mutant with a modest but positive selective co-efficient will stand more chance of drifting to extinction by chance than in a larger population where selection has more time to operate
can also be lowered by adaptive mutations rising to fixation and purging the population diversity
known as a selective seep and means adaptation can counter-intuitively come with a cost to the population: less diversity if other threats/environmental changes occur - less standing variation than previously
have witnessed hard sweeps in
sars-cov2
- as one variant replaces another this results in a loss of viral diversity
still not clear how many of the mutations for cov2 are adaptive
because most mutations are synonymous many commentators felt early on in the pandemic that mutations were nothing to worry about
epidemiologists in particular tend to see neutrality as the default null model to be disproved
keen to point out that mutations can spread guar by chance demographic effects that have nothing to do with selection
others have suggested that many mutations are so deleterious that they are never observed and thus emphasised strong purifying selection in sars-cov2 genomes
the selective coefficient (s) - selection is stronger when s is high
multiplying the two together is greater than 1 -> selection starts to act
Nes < 1: mutations are effectively neutral ie. drift predominates but if Nes > 1 : then selection is important
TESTS FOR SELECTION: THE dN/dS* ratio - statistical methods to gauge selection
non synonymous mutations are most likely to be slightly deleterious because they alter the amino acid sequence of the proteins (but can also occasionally be adaptive(
purifying selection on slightly deleterious non-synonymous changes takes time
more recently emerged mutations are more likely to be non-synonymous and the dN/dS ratio within asexual populations decreases over time as slightly deleterious non-synonymous changes are selected out
N/S - the number of non synonymous mutations over synonymous mutations - can assume synonymous mutations are close to neutral than non synonymous
synonymous mutations are likely to be approaching neutrality
are these changes at synonymous sites truly neutral?
by definition they don't change the protein seqeunce
sites at the third codon position that are free to change to any other base without changing the amino acid are called 4 fold degenerate sites eg. GGN for glycine - so these should be typically neutral?
there are no synonymous changes in the middle of the codon
3rd is the most susceptible to selection
selection on synonymous sites - codon bias
codon bias refers to the unequal use of synonymous codons, and is thought to result from a combination of mutational biases and selective pressures relating to the speed and accuracy of translation
for each amino acid there exists an optimal codon that is most commonly employed in highly expressed genes and matching the most abundant tRNA
codon bias is much stronger in highly expressed genes than in lowly expressed genes
sometimes there is strong evidence of specific selective effects operating on synonymous sites
a synonymous SNP can increase resistance to antibiotics by changing the expression of a porin gene in
Klebsiella pneumoniae
D corrects for randomness
the dN/dS+ (Ka/Ks) ratio therefore provides means to gauge the strength and direction of selection
dN/dS > 1 = positive selection - ns changes are advantageous and selected for
dN/dS < 1 - negative selection - more s than n, selection is weeding out the n - most common
dN/dS = 1 - neutrality
refers to the
*per site
rate of non synonymous and synonymous change - it corrects for the fact that, by chance, there would be 3 times more non synonymous changes than synonymous changes
THE NEARLY NEUTRAL THEORY PREDICTS dN/dS SHOULD BE HIGHER FOR SMALL POPULATIONS, WHERE SELECTION IS LESS EFFICIENT
johnson and seger 2001
compared the numbers of synonymous and non synonymous changes in two mt genes within nine closely related island and mainland lineages of ducks and doves
in all nine cases more non synonymous changes were found in the island species with no bias in the number of synonymous changes
a big population gives a lower ends which is more efficient than smaller island population (dnds is closer to 1)
is there selection on intergenic sites?
approximately 10-15% of sites within bacterial genomes correspond to non protein-coding intergenic regions (IGRs)
in the past these sequences may have been referred to as junk aka we don't know what it does, but IGRs are now known to contain many regulatory elements
promoters
terminators
ribosome binding sites
small RNAs
there is a problem in gauging the strength of selective constraint operating on intergenic variation as compared to coding regions - you can't use dN/dS ratios
strength of purifying selection on intergenic sites falls midway between that on synonymous and non-synonymous sites - the RAREST mutations are likely to be under the strongest negative selection
the use of site frequency spectra (SFS)
the frequency spectra of individual mutations (that is how many are very rare, common or intermediate) tells us something about how selection is operating
the nearly neutral theory predicts an excess of rare variation, and that the most deleterious variants will be the most rare
this forms the basis of Tajimas D test
what proportion of mutations fall in to the most rare class
ithey are only observed once in a dataset go hundreds of genomes
approach has been used to detect negative (purifying) selection on intergenic sites
deleterious SNPs under negative (purifying) selection are more likely to be rare in the population (as they will be purged quickly)
a high proportion of singleton SNPs (those present in only one genome = very rare) thus indicates stronger purifying selection
Thorpe et al compared the proportion of mutations belonging to this most rare class in all six species for
nonsense
nonsynonymous
synonymous
intergenic SNPs
bacterial competition and virulence