MLST

MultiLocus Sequence Typing (MLST)

Looking at differences within just one gene may not be sufficient to gather enough features

MLST - considering portions of multiple genes as features to look at relationships within a species and categorise into subspecies

For detailed definition and procedure - Lecture 17, slide 3

The MLST number denotes the type of genes selected for typing. Can be different for different kinds of genomes/isolates. E.g. a species can have MLST 56 and MLST 97 types.

More closely related organisms can be differentiated by the same set of genes (same MLST number)

Recombination effects must be taken into account. Genes can change due to this, and no longer be as useful during MLST

Core Genome

The portions of genome that is SIMILAR among many isolate sequences (after multiple alignment) - more closely related = larger core genome (E.g. within species would have larger core genome than within genus)

Lecture 17, slide 9-14

Differences within this core genome (variable regions within core genome) are recognised as SNP differences and MLST genes - used to categorise

Building a tree from core variable regions - Lecture 17, slide 18-33

Workflow

Reference genome selection

Has to be a closed/finished genome sequence

Has to be related to the isolates (same species)

Star Alignment - pairwise alignment of isolates to central reference sequence

First sequence the isolates using WGS and then align the short reads of each isolate directly to the reference

Some reads don't map to any region in the reference - belong to isolate but not to reference

Some regions of the reference are not mapped to - belong to reference but not to isolates

IGNORE

Some reads that map to multiple regions in the reference - multi mapping reads

Some reads that are mapped and have SNPs - reliable SNP calling only if read depth is sufficient

CONSIDER

Combine the BAM files for each pairwise alignment and consider only the SNP differences in the Core genome

Can include small indels

Infer Tree from this

Useful in Public Health Microbiology that is interested mostly in only near identical isolates (for disease spreading study, etc.)

Don't assemble the reads into full sequence and then align full sequence with reference. Instead, directly align the reads to the reference.

Basically parts where there are no gaps in all the sequences, post alignment

Limitation of tree inference method - High resolution comparisons only within a species, performs poorly for genus level comparisons and so on.

Pan Genome = Core + Accessory

All the genetic elements present in the whole species - all variations from the sample are included

Accessory Genome

Non-common regions among all sequences - transposons, plasmids, prophage, etc.

tool: Snippy-core for core genome alignment of variable regions

Tool: IQ-tree

To identify major genetic subgroups within the species

Tool: mlst

Specific parts of each of these genes can vary

These genes are called housekeeping genes (make the bacteria viable). Present in every bacteria - but some may not have an MLST scheme developed yet.

A core SNP tree is constructed based on the SNP distance matrix from the core genome only.

Possible to construct a pan-genome tree from the full pan genome (including where some sequences have gaps).

Each unique variation is called an allele

In Eukaryotes, essentially the whole genome is a core genome and there is hardly any accessory genome, because all organisms within a species mostly have the same genes (no extra or less), unless there is some mutation.

Whereas in bacteria, you can have extra plasmids, phages, etc. and hence they have a well established accessory genome

Same MLST profile does not necessarily mean that the entire genome sequences are also same.