WEEK 3: GENETIC VARIATION
Small-scale variation
Consequences of variation
Research
Mid- to large-scale variation
Nature of variation (change to base sequence)
Human
populations
Within populations
Between populations
Nucleotide diversity in introns, regulatory sequences, flanking sequences
Comprises 85% of total genetic variation
Frequencies of alleles may vary, esp. for morphological traits
33% of protein-encoding loci are polymorphic
Types
Do not affect DNA content
Affect DNA content
Net loss/gain of DNA sequence
Number of nucleotides unchanged
Multiple nucleotides move location without net loss (rare)
Translocation
Inversion
Change in copy number of sequence (large or small)
Abnormal chromosome segregation
Indel of single NT or short sequence to Mb DNA
DNA
variants
Alternative form of DNA produced by mutation
0.01 frequency
Polymorphism
<0.01 frequency
Rare
Venter & Watson diploid genome sequencing compared to reference
3.2M SNPs
290k heterozygous indel variants (1-571 bp)
559k homozygous indel variants (1-82,711 bp)
90 large inversions
62 large-copy-number variants
Total 12M+ nucleotides different (majority non-coding)
44% Venter genes had sequence variant (17% encoded altered protein)
Single nucleotide replaced
Single nucleotide substitution
SNV = single nucleotide variants
2+ DNA variants exceeding frequency of 0.01 in population
SNP = single nucleotide polymorphism (two alleles)
Non-random patterns
Evolutionary ancestry
Different regions undergo different mutation rates
Excess of CT substitutions (methylation)
Mitochondrial DNA > nuclear
Alternative SNPs mark alternative ancestral chromosome segments common in present day population
1.1 x 10^-8 per generation, 1 per 100 Mb
Certain NTs polymorphic, others rarely show variants
Indels
Technically should be copy number variants, but modern convention defines as deletions/insertions up to 50 nucleotides
1/10th frequency of single nucleotide substitution
Short insertions
more common than long
90% are 1-10 nucleotides
9% are 11-100 nucleotides
1% are greater than 100 nucleotides
Repetitive DNA accounts for large fraction of human genome
Tandem copies
(1-200bp) are common
Multiple repeats sections are prone
to variation
Minisatellite DNA
Microsatellite DNA
Satellite DNA
20kb - 100s kb
Telomeres, subtelomeric regions
<100bp
Centromeres, heterochromatic regions
100bp - 20kb
Euchromatin
Repeat sequence instability
Variants differ in number of repeats
Copy number variation
Results from replication slippage or unequal crossover
Have multiple alleles (unlike SNPs)
RFLP (Restriction fragment length polymorphism) due to gain/loss of RE (restriction endonuclease) caused by SNP subset
Slippage causes insertion when template strand loops out
Slippage causes deletion when sense strand loops out
Diversity
Meiotic recombination between misfired repeats change unit number
Misaligned chromatids on homologous chromosomes
Misaligned chromatids on sister chromatids
Unequal
crossover
Unequal sister chromatid exchange
resulting in two chromatids (one with extra repeat, one with unit missing)
Markers
More informative than SNPs for distinguishing between individuals or following chromosome segments through pedigree
Early years HGP devoted to defining and mapping microsatellites (~150 000 identified)
Genetic marker of choice since 1990s
Not as easy to automate as SNPs
Origins
Errors in replication or recombination
Unavoidable
Usually quickly corrected by DNA polymerase
Damage and chemical alteration of DNA by endo-/exogenous sources
Errors in chromosome segregation
Abnormal gametes
Fewer/more chromosomes
Structural variation
Balanced
Unbalanced
DNA variants have same DNA content but differ in some DNA sequences located in different positions in genome
DNA variants differ in DNA content: rare case where person has gained/lost chromosomal region often resulting in disease
Chromosomes break and fragments are incorrectly rejoined, without loss or gain of DNA (i.e. inversions/translocations)
Also includes commonly occurring CNV (copy number variants) along moderately to very long DNA sequence, some contributing to disease
= 25% of mutation events, dominated by CNV
1 per 1000bp between maternal/paternal (personal sequencing)
For any SNP loci, many individuals with two haploid genomes will be homozygous
~1/100 NTs (vast majority rare in population)
From population-based genome, ~75% DNA changes are single nucleotide changes (i.e. most common variation type)
Databases
Building variant maps for gene-finding
dbVar
DGV
dbSNP
ALFRED
SNPs and other short genetic variations
Genomic structural variation
Allele frequencies in human populations
Human Genome Project
Good for consensus
Identify genetic variants
Assay genetic variants
Anonymous with respect to traits
Verify polymorphisms, catalogue correlations amongst sites
Not good for individual differences
SNP Discovery
Two phases
Goals
Identify 300,000 SNPs
Phase 1: SNP Discovery
Phase 2: SNP Characterisation
Determine allele frequency of SNPs
Need reference genome to find SNPs: HGP
Projects
HapMap
1000 Genomes
Produce fine-scale genetic map: common resource for biomedical researchers
Genotype 600,000-1,000,000 SNPS genome-wide
Four populations: CEPH (Europe), Yoruban (Africa), Japanese/Chinese (Asian)
Phases
Two: Additional 4.6M SNPs genotyped
One: 1M common (minor allele freq. >= 0.05) SNPs (every 5kb across genome) genotyped in 269 DNA samples from four populations
Phase 1
14 populations: Europe, East Asia, sub-Saharan Africa, America
Genotyping 1092 individuals
Whole-genome (low coverage; 2-6x) and exome sequencing (deep coverage; 50-100x)
Phase 3
Most recent
2535 individuals
26 populations
Exome and whole-genome data
Neutral
Majority neutral effect on phenotype
Many DNA changes no effect (coding, regulatory, non-coding RNA) even within small target of sequences important for gene function
Functional genetic variation
(i.e. variants with effect on gene function)
Difficult to estimate how much of genome is functionally important
Extremes
Virtually all amino acids can be replaced while maintaining original function
New function gained
Single mutation may be sufficient
If >1 mutation needed, order of mutation events may be important (many evolutionary failures)
Mutation nomenclature
Base replacement
5162GA = guanine to adenine at base position 5162
Indel
197delAG
2552insT
Amino acid replacement
R197G = R to G at AA position 197
OMIM
System of cataloging human genes and genetic diseases
ENCODE
Encyclopedia of DNA Elements (2007)
Preceding
projects
2003: Human genome complete
2005: Human Epigenome Project (aimed to identify, catalogue, and interpret genome-wide DNA methylation patterns of all human genes in all major tissues)
2006: International Human Epigenome Project (HIEP) (aimed to decipher at least 1000 epigenomes within 7-10 years, and provide high resolution maps of histone modifications/lDNA methylation/transcription start sites/non-coding + RNAs
Progress
Began as pilot project on 1% of genome
2007: Effort scaled to whole-genome assays followed by expansion to similar assays in mouse
Comprehensive catalogue of gene and functional elements in human and mouse genomes
Measure RNA expression levels
Identify proteins that interact with RNA/DNA e.g. modified histones, transcription factors, RNA-binding proteins
Measure levels of DNA methylation
Identify regions of DNA hypersensitivity
Population genetics #
Areas of investigation
Genetic variation within population (genetic composition)
Comparison of populations
Processes that lead to genetic composition changes
Causes of genetic change in populations
New alleles introduced by mutation
Migration changing population composition
Differential reproduction by different genotypes resulting in natural selection
Mating may be random/assortative with in-/outbreeding
Recombination produces new allele combinations
Random fluctuation in reproductive rates resulting in genetic drift in allele frequencies
Mutation rates
Probability that a copy of an allele changes to another allelic form in one generation
Increase in frequency of a mutant allele = mutation rate x frequency of non-mutant allele
Mating
Assortative mating (+ or -)
Inbreeding
Whole genome
Trait-specific
Increase in homozygosity
Causes departure from Hardy Weinberg frequencies
Alleles identical by state
(alike in structure/function but not origin)
Alleles identical by descent
(copies descended from single allele present in ancestor)
Statistics
Inbreeding coefficient F
Probability two alleles are identical by descent
0 = mating occurs randomly in large population
1 = all alleles identical by descent
Measured by pedigree analysis or reduction of heterozygosity in population
Self-
fertilisation
Repeated generations of inbreeding splits a heterozygous population into series of completely homozygous lines
Japanese study 1965
10% increase in F = 6pt drop in IQ
Children of 1st cousins = 40% increased mortality
Consanguineous marriage
1st degree relative share 1/2 genes
Parents (always)
Full siblings (on average)
2nd degree
share 1/4 genes
Grandparents/children, uncles/aunts, nephews, half-siblings (on average)
3rd degree
relative
Share 1/8 genes
on average
Coefficient of relationship R #
Proportion of alleles shared by two persons due to common genetic descent from one or more recent common ancestors
= 2F
Darwinian Evolution
Principle of heredity
Principle of variation
Principle of
selection
Variation exists in morphology, physiology, behaviour among members of population
Offspring resemble parents more than individuals to which they are unrelated
Some variants more successful at surviving and reproducing than other variants in given environment; variants of higher fitness are naturally selected
Sickle cell anaemia
Autosomal recessive disease
AA (normal)
SS
Severe anaemia
Hb crystallises at low O2 levels causing RBCs to become sickle-shaped and rupture
AS
Mild anaemia does not allow malaria entry; higher fitness in malaria areas
Altered environments adaptation
Malaria-infested environment
RBC physiology alterations affecting transmission of P. falciparum or P. vivax and increased resistance to malaria
Lifelong intake
of fresh milk
High-altitude
(low O2 tension)
High dietary starch
Reduced sunlight (low UV)
Decreased pigmentation allowing more efficient transmission of depleted UV to deep layer of dermis to synthesise Vitamin D
Lowered haemoglobin levels and high density of blood capillaries provide protection against hypoxia
Persistence of lactase production in adults allowing efficient digestion of lactose
Increased production of enzyme needed to digest starch efficiently
SLC24A5 variant replacing ancestral alanine at position 111 by threonine
EPAS1 variants (key gene in hypoxia reponse)
Pathogenic mutations in HBB or G6PD for P. falciparum; inactivated DARC variants not expressing Duffy antigen in P. vivax malaria
13910T allele about 14kb upstream of lactase gene LCT
High AMY1A copy number
Selective sweep
Variant becomes fixed in a population
MHC polymorphism
Pathogen-driven: strong selection pressure due to emergence of mutant pathogens that seek to evade MHC-mediated detection
Gene duplication: multiple MHC genes with different peptide-binding specificities
Many MHC genes extraordinarily polymorphic; most of all proteins
Most polymorphic loci
A
B
C
DPB1
DRB1
DQB1