Please enable JavaScript.
Coggle requires JavaScript to display documents.
MLST, MultiLocus Sequence Typing (MLST) - Coggle Diagram
MLST
MultiLocus Sequence Typing (MLST)
Looking at differences within just one gene may not be sufficient to gather enough features
MLST
- considering
portions of multiple genes
as features to look at relationships within a species and
categorise into subspecies
For detailed definition and procedure - Lecture 17, slide 3
The
MLST number
denotes the type of genes selected for typing. Can be different for different kinds of genomes/isolates. E.g. a species can have MLST 56 and MLST 97 types.
More closely related organisms can be differentiated by the same set of genes (same MLST number)
Same MLST profile does not necessarily mean that the entire genome sequences are also same
.
Recombination effects
must be taken into account. Genes can change due to this, and no longer be as useful during MLST
Specific parts of each of these genes can vary
Each unique variation is called an allele
These genes are called
housekeeping genes (make the bacteria viable)
. Present in every bacteria - but some may not have an MLST scheme developed yet.
Core Genome
The portions of genome that is
SIMILAR
among many isolate sequences (after multiple alignment) -
more closely related = larger core genome
(E.g. within species would have larger core genome than within genus)
Lecture 17, slide 9-14
Differences within this core genome (
variable regions within core genome
) are recognised as
SNP differences and MLST genes
- used to categorise
Building a tree from core variable regions
- Lecture 17, slide 18-33
Workflow
Reference genome selection
Has to be a closed/finished genome sequence
Has to be related to the isolates (same species)
Star Alignment
- pairwise alignment of isolates to central reference sequence
First
sequence the isolates
using WGS and then
align the short reads
of each isolate directly to the reference
Some reads don't map to any region in the reference -
belong to isolate but not to reference
1 more item...
Some regions of the reference are not mapped to -
belong to reference but not to isolates
1 more item...
Some reads that map to multiple regions in the reference -
multi mapping reads
1 more item...
Some reads that are
mapped
and have
SNPs
- reliable SNP calling only if
read depth
is sufficient
1 more item...
Don't assemble
the reads into full sequence and then align full sequence with reference. Instead, directly align the reads to the reference.
A
core SNP tree
is constructed based on the SNP distance matrix from the core genome only.
tool: Snippy-core for core genome alignment of variable regions
Basically parts where there are no gaps in all the sequences, post alignment
Limitation of tree inference method -
High resolution comparisons only within a species, performs poorly for genus level comparisons and so on.
Pan Genome
= Core + Accessory
All the genetic elements present in the
whole species
- all variations from the sample are included
Possible to construct a
pan-genome tree
from the full pan genome (including where some sequences have gaps).
In Eukaryotes, essentially the whole genome is a core genome
and there is hardly any accessory genome, because all organisms within a species mostly have the same genes (no extra or less), unless there is some mutation.
Whereas in bacteria, you can have extra plasmids, phages, etc. and hence they have a well established accessory genome
Accessory Genome
Non-common regions
among all sequences - transposons, plasmids, prophage, etc.
To identify major genetic subgroups within the species
Tool: mlst