MLST
MultiLocus Sequence Typing (MLST)
Looking at differences within just one gene may not be sufficient to gather enough features
MLST - considering portions of multiple genes as features to look at relationships within a species and categorise into subspecies
For detailed definition and procedure - Lecture 17, slide 3
The MLST number denotes the type of genes selected for typing. Can be different for different kinds of genomes/isolates. E.g. a species can have MLST 56 and MLST 97 types.
More closely related organisms can be differentiated by the same set of genes (same MLST number)
Recombination effects must be taken into account. Genes can change due to this, and no longer be as useful during MLST
Core Genome
The portions of genome that is SIMILAR among many isolate sequences (after multiple alignment) - more closely related = larger core genome (E.g. within species would have larger core genome than within genus)
Lecture 17, slide 9-14
Differences within this core genome (variable regions within core genome) are recognised as SNP differences and MLST genes - used to categorise
Building a tree from core variable regions - Lecture 17, slide 18-33
Workflow
Reference genome selection
Has to be a closed/finished genome sequence
Has to be related to the isolates (same species)
Star Alignment - pairwise alignment of isolates to central reference sequence
First sequence the isolates using WGS and then align the short reads of each isolate directly to the reference
Some reads don't map to any region in the reference - belong to isolate but not to reference
Some regions of the reference are not mapped to - belong to reference but not to isolates
IGNORE
Some reads that map to multiple regions in the reference - multi mapping reads
Some reads that are mapped and have SNPs - reliable SNP calling only if read depth is sufficient
CONSIDER
Combine the BAM files for each pairwise alignment and consider only the SNP differences in the Core genome
Can include small indels
Infer Tree from this
Useful in Public Health Microbiology that is interested mostly in only near identical isolates (for disease spreading study, etc.)
Don't assemble the reads into full sequence and then align full sequence with reference. Instead, directly align the reads to the reference.
Basically parts where there are no gaps in all the sequences, post alignment
Limitation of tree inference method - High resolution comparisons only within a species, performs poorly for genus level comparisons and so on.
Pan Genome = Core + Accessory
All the genetic elements present in the whole species - all variations from the sample are included
Accessory Genome
Non-common regions among all sequences - transposons, plasmids, prophage, etc.
tool: Snippy-core for core genome alignment of variable regions
Tool: IQ-tree
To identify major genetic subgroups within the species
Tool: mlst
Specific parts of each of these genes can vary
These genes are called housekeeping genes (make the bacteria viable). Present in every bacteria - but some may not have an MLST scheme developed yet.
A core SNP tree is constructed based on the SNP distance matrix from the core genome only.
Possible to construct a pan-genome tree from the full pan genome (including where some sequences have gaps).
Each unique variation is called an allele
In Eukaryotes, essentially the whole genome is a core genome and there is hardly any accessory genome, because all organisms within a species mostly have the same genes (no extra or less), unless there is some mutation.
Whereas in bacteria, you can have extra plasmids, phages, etc. and hence they have a well established accessory genome
Same MLST profile does not necessarily mean that the entire genome sequences are also same.