Sequencing, DNA sequencing - Reading the order of nucleotides in a DNA -…
DNA sequencing - Reading the order of nucleotides in a DNA
Whole Genome Shotgun (WGS)
Genome is fragmented (from different random starting points) and amplified to produce many fragments.
Read sequences are produced when SEQUENCING machines read the fragments.
Long read sequencing
Oxford Nanopore - 40,000++ bp read length
DNA passing through Nanopore in an electrically resistant membrane, thus disturbing the current passing. Each nucleotide has a characteristic disturbance.
Can sequence both the strands as one contiguous read back to back
Lecture 3 slide 23
No need of amplification (no amplification bias), quality does not depend on read length.
Lower per base quality
Long reads can be around 40,000 bases or more
PacBio Single Molecule RealTime (SMRT) sequencing - 40,000 bp read length
DNA polymerase in each 20 zeptolitre reaction chambers/wells (Zero Mode Waveguides [ZMVs])
Free phospholinked nucleotides are attached with fluorophores. As bases get incorporated by the polymerase, the phosphate linked flourophore chains are severed off and normal DNA gets synthesised.
Each base has a colour and duration (indicating methylation, etc.). The colours are read and their order reveals the sequence of the DNA.
Lecture 3 slide 21,22
Can get 1 contig per chromosome
Celera Assembler Nu (CANU)
Short read sequencing
Illumina - 350 bp read length
Nucleotides tagged with fluorescence, DNA polymerase, primers attached to flow cells, adaptors attached to DNA fragments, ssDNA template attached to flow cell
- Break down DNA and attach adaptors to fragments
CLUSTER AMPLIFICATION (NO PCR)
DNA fragments attach to templates through adaptors to initiate cycle. Through cycles of isothermal denaturation, bridge formation and cloning, cluster amplification takes place.
- After amplification, free nucleotides with flourescence move through the cell and get incorporated. The flow cell's colour pattern is recorded each time, where each colour represents a base.
1 more item...
High throughput, low cost, low error rates and consistency
Not detecting phasing - can't detect variants in each cluster with high accuracy because so many cycles happen in each cluster, and so many reactions to incorporate the same base.
Crosstalk - Overlap of signals and diffuse outputs
Not detecting GC DNA because they can form secondary structures with the polymerase.
Use of astronomy software to locate random arrays on surface
Image alignment after each cycle
Better image processing for to enable phasing and increase output resolution
Removal of adaptors
Hi Seq x10
Human genome - 150 bp read length
Bacterial genomes - 350 bp read length
Lecture 4, slide 21,22
High per base quality
Short reads can be around 100-1000bp (shortest was 18 bp)
Sanger Sequencing - > 500 bp read length
Low throughput - low number of sequences read at a time
Reaction chamber containing ssDNA template, oligonucleotide primers, di deoxynucleotide with fluorescence, regular nucleotides, DNA polymerase.
Primers attach to template and free nucleotides get incorporated by polymerase, until a ddNTP is incorporated. ddNTP TERMINATES synthesis.
The resulting different sized fragments are passed through gel electrophoresis, where fragments are pulled through a gel by the application of electricity. The smaller fragments travel farthest and hence they are separated by size.
As they pass through a flouroscent detection unit, the sequence of DNA is read, by reading the end base of each fragment of consecutive sizes.
Human Genome Project
lecture 4, slide 17-20
Reads are assembled to overlap -
- This process yields contiguous unbroken consensus sequences (from many overlapping reads (read depth)), known as contigs.
Ideally, we expect 1 contig per chromosome. But
due to repeating regions being longer than read length, low sequencing depth and read errors,
contigs can be broken and multiple, for a single chromosome/isolate
Output of the assembly process is hence a set of contigs.
Assembly is produced by tools such as
SPAdes, Velvet, MegaHit, Skesa, Unicycler
FASTA format, multi FASTA files
Lecture 3, slide 30, 31
FASTQ format, multi FASTQ files
Lecture 3, slide 33,34
Lecture 4, slide 4
Quality is encoded by letters or symbols for each base
Forward and reverse reads of paired reads in separate files
Phred Quality scores
- measure of confidence of a base being called/assigned correctly during sequence assembly
Lecture 4, slide 9
- tool that displays data quality in a readset (average across reads for each position)
Drops as we go towards end of read - hence read length matters
When dealing with multiple read sets, MultiQC provides a summary of all FastQC reports
Sequences unrelated to sample that exist in the sample
- attached to ends of DNA fragments to facilitate attachment to primers and initiate synthesis - E.g. during PCR
1 more item...
- samples can also be spiked with noise to test the sequencing machine
1 more item...
4 more items...
Tool for visualizing assemblies - Bandage
- Reads are converted to overlapping Kmers and these are assembled using De Bruijn graph method
Reference Alignment of reads - Read Mapping
Read Alignment - SemiGlobal
To find position of sequenced reads in a reference - Read mapping
De Novo Genome assembly of reads
Required for genome assembly of non-model organisms, novel sequences, novel splice variants (disease causing)