WEEK 3: GENETIC VARIATION

Small-scale variation

Consequences of variation

Research

Mid- to large-scale variation

Nature of variation (change to base sequence)

Human
populations

Within populations

Between populations

Nucleotide diversity in introns, regulatory sequences, flanking sequences

Comprises 85% of total genetic variation

Frequencies of alleles may vary, esp. for morphological traits

33% of protein-encoding loci are polymorphic

Types

Do not affect DNA content

Affect DNA content

Net loss/gain of DNA sequence

Number of nucleotides unchanged

Multiple nucleotides move location without net loss (rare)

Translocation

Inversion

Change in copy number of sequence (large or small)

Abnormal chromosome segregation

Indel of single NT or short sequence to Mb DNA

DNA
variants

Alternative form of DNA produced by mutation

0.01 frequency

Polymorphism

<0.01 frequency

Rare

Venter & Watson diploid genome sequencing compared to reference

3.2M SNPs

290k heterozygous indel variants (1-571 bp)

559k homozygous indel variants (1-82,711 bp)

90 large inversions

62 large-copy-number variants

Total 12M+ nucleotides different (majority non-coding)

44% Venter genes had sequence variant (17% encoded altered protein)

Single nucleotide replaced

Single nucleotide substitution

SNV = single nucleotide variants

2+ DNA variants exceeding frequency of 0.01 in population

SNP = single nucleotide polymorphism (two alleles)

Non-random patterns

Evolutionary ancestry

Different regions undergo different mutation rates

Excess of CT substitutions (methylation)

Mitochondrial DNA > nuclear

Alternative SNPs mark alternative ancestral chromosome segments common in present day population

1.1 x 10^-8 per generation, 1 per 100 Mb

Certain NTs polymorphic, others rarely show variants

Indels

Technically should be copy number variants, but modern convention defines as deletions/insertions up to 50 nucleotides

1/10th frequency of single nucleotide substitution

Short insertions
more common than long

90% are 1-10 nucleotides

9% are 11-100 nucleotides

1% are greater than 100 nucleotides

Repetitive DNA accounts for large fraction of human genome

Tandem copies
(1-200bp) are common

Multiple repeats sections are prone
to variation

Minisatellite DNA

Microsatellite DNA

Satellite DNA

20kb - 100s kb

Telomeres, subtelomeric regions

<100bp

Centromeres, heterochromatic regions

100bp - 20kb

Euchromatin

Repeat sequence instability

Variants differ in number of repeats

Copy number variation

Results from replication slippage or unequal crossover

Have multiple alleles (unlike SNPs)

RFLP (Restriction fragment length polymorphism) due to gain/loss of RE (restriction endonuclease) caused by SNP subset

Slippage causes insertion when template strand loops out

Slippage causes deletion when sense strand loops out

Diversity

Meiotic recombination between misfired repeats change unit number

Misaligned chromatids on homologous chromosomes

Misaligned chromatids on sister chromatids

Unequal
crossover

Unequal sister chromatid exchange
resulting in two chromatids (one with extra repeat, one with unit missing)

Markers

More informative than SNPs for distinguishing between individuals or following chromosome segments through pedigree

Early years HGP devoted to defining and mapping microsatellites (~150 000 identified)

Genetic marker of choice since 1990s

Not as easy to automate as SNPs

Origins

Errors in replication or recombination

Unavoidable

Usually quickly corrected by DNA polymerase

Damage and chemical alteration of DNA by endo-/exogenous sources

Errors in chromosome segregation

Abnormal gametes

Fewer/more chromosomes

Structural variation

Balanced

Unbalanced

DNA variants have same DNA content but differ in some DNA sequences located in different positions in genome

DNA variants differ in DNA content: rare case where person has gained/lost chromosomal region often resulting in disease

Chromosomes break and fragments are incorrectly rejoined, without loss or gain of DNA (i.e. inversions/translocations)

Also includes commonly occurring CNV (copy number variants) along moderately to very long DNA sequence, some contributing to disease

= 25% of mutation events, dominated by CNV

1 per 1000bp between maternal/paternal (personal sequencing)

For any SNP loci, many individuals with two haploid genomes will be homozygous

~1/100 NTs (vast majority rare in population)

From population-based genome, ~75% DNA changes are single nucleotide changes (i.e. most common variation type)

Databases

Building variant maps for gene-finding

dbVar

DGV

dbSNP

ALFRED

SNPs and other short genetic variations

Genomic structural variation

Allele frequencies in human populations

Human Genome Project

Good for consensus

Identify genetic variants

Assay genetic variants

Anonymous with respect to traits

Verify polymorphisms, catalogue correlations amongst sites

Not good for individual differences

SNP Discovery

Two phases

Goals

Identify 300,000 SNPs

Phase 1: SNP Discovery

Phase 2: SNP Characterisation

Determine allele frequency of SNPs

Need reference genome to find SNPs: HGP

Projects

HapMap

1000 Genomes

Produce fine-scale genetic map: common resource for biomedical researchers

Genotype 600,000-1,000,000 SNPS genome-wide

Four populations: CEPH (Europe), Yoruban (Africa), Japanese/Chinese (Asian)

Phases

Two: Additional 4.6M SNPs genotyped

One: 1M common (minor allele freq. >= 0.05) SNPs (every 5kb across genome) genotyped in 269 DNA samples from four populations

Phase 1

14 populations: Europe, East Asia, sub-Saharan Africa, America

Genotyping 1092 individuals

Whole-genome (low coverage; 2-6x) and exome sequencing (deep coverage; 50-100x)

Phase 3

Most recent

2535 individuals

26 populations

Exome and whole-genome data

Neutral

Majority neutral effect on phenotype

Many DNA changes no effect (coding, regulatory, non-coding RNA) even within small target of sequences important for gene function

Functional genetic variation
(i.e. variants with effect on gene function)

Difficult to estimate how much of genome is functionally important

Extremes

Virtually all amino acids can be replaced while maintaining original function

New function gained

Single mutation may be sufficient

If >1 mutation needed, order of mutation events may be important (many evolutionary failures)

Mutation nomenclature

Base replacement

5162GA = guanine to adenine at base position 5162

Indel

197delAG

2552insT

Amino acid replacement

R197G = R to G at AA position 197

OMIM

System of cataloging human genes and genetic diseases

ENCODE

Encyclopedia of DNA Elements (2007)

Preceding
projects

2003: Human genome complete

2005: Human Epigenome Project (aimed to identify, catalogue, and interpret genome-wide DNA methylation patterns of all human genes in all major tissues)

2006: International Human Epigenome Project (HIEP) (aimed to decipher at least 1000 epigenomes within 7-10 years, and provide high resolution maps of histone modifications/lDNA methylation/transcription start sites/non-coding + RNAs

Progress

Began as pilot project on 1% of genome

2007: Effort scaled to whole-genome assays followed by expansion to similar assays in mouse

Comprehensive catalogue of gene and functional elements in human and mouse genomes

Measure RNA expression levels

Identify proteins that interact with RNA/DNA e.g. modified histones, transcription factors, RNA-binding proteins

Measure levels of DNA methylation

Identify regions of DNA hypersensitivity

Population genetics #

Areas of investigation

Genetic variation within population (genetic composition)

Comparison of populations

Processes that lead to genetic composition changes

Causes of genetic change in populations

New alleles introduced by mutation

Migration changing population composition

Differential reproduction by different genotypes resulting in natural selection

Mating may be random/assortative with in-/outbreeding

Recombination produces new allele combinations

Random fluctuation in reproductive rates resulting in genetic drift in allele frequencies

Mutation rates

Probability that a copy of an allele changes to another allelic form in one generation

Increase in frequency of a mutant allele = mutation rate x frequency of non-mutant allele

Mating

Assortative mating (+ or -)

Inbreeding

Whole genome

Trait-specific

Increase in homozygosity

Causes departure from Hardy Weinberg frequencies

Alleles identical by state
(alike in structure/function but not origin)

Alleles identical by descent
(copies descended from single allele present in ancestor)

Statistics

Inbreeding coefficient F

Probability two alleles are identical by descent

0 = mating occurs randomly in large population

1 = all alleles identical by descent

Measured by pedigree analysis or reduction of heterozygosity in population

Self-
fertilisation

Repeated generations of inbreeding splits a heterozygous population into series of completely homozygous lines

Japanese study 1965

10% increase in F = 6pt drop in IQ

Children of 1st cousins = 40% increased mortality

Consanguineous marriage

1st degree relative share 1/2 genes

Parents (always)

Full siblings (on average)

2nd degree
share 1/4 genes

Grandparents/children, uncles/aunts, nephews, half-siblings (on average)

3rd degree
relative

Share 1/8 genes
on average

Coefficient of relationship R #

Proportion of alleles shared by two persons due to common genetic descent from one or more recent common ancestors

= 2F

Darwinian Evolution

Principle of heredity

Principle of variation

Principle of
selection

Variation exists in morphology, physiology, behaviour among members of population

Offspring resemble parents more than individuals to which they are unrelated

Some variants more successful at surviving and reproducing than other variants in given environment; variants of higher fitness are naturally selected

Sickle cell anaemia

Autosomal recessive disease

AA (normal)

SS

Severe anaemia
Hb crystallises at low O2 levels causing RBCs to become sickle-shaped and rupture

AS

Mild anaemia does not allow malaria entry; higher fitness in malaria areas

Altered environments adaptation

Malaria-infested environment

RBC physiology alterations affecting transmission of P. falciparum or P. vivax and increased resistance to malaria

Lifelong intake
of fresh milk

High-altitude
(low O2 tension)

High dietary starch

Reduced sunlight (low UV)

Decreased pigmentation allowing more efficient transmission of depleted UV to deep layer of dermis to synthesise Vitamin D

Lowered haemoglobin levels and high density of blood capillaries provide protection against hypoxia

Persistence of lactase production in adults allowing efficient digestion of lactose

Increased production of enzyme needed to digest starch efficiently

SLC24A5 variant replacing ancestral alanine at position 111 by threonine

EPAS1 variants (key gene in hypoxia reponse)

Pathogenic mutations in HBB or G6PD for P. falciparum; inactivated DARC variants not expressing Duffy antigen in P. vivax malaria

13910T allele about 14kb upstream of lactase gene LCT

High AMY1A copy number

Selective sweep

Variant becomes fixed in a population

MHC polymorphism

Pathogen-driven: strong selection pressure due to emergence of mutant pathogens that seek to evade MHC-mediated detection

Gene duplication: multiple MHC genes with different peptide-binding specificities

Many MHC genes extraordinarily polymorphic; most of all proteins

Most polymorphic loci

A

B

C

DPB1

DRB1

DQB1