Sequencing

DNA sequencing - Reading the order of nucleotides in a DNA

Whole Genome Shotgun (WGS)

Genome is fragmented (from different random starting points) and amplified to produce many fragments.

Read sequences are produced when SEQUENCING machines read the fragments.

Long read sequencing

Oxford Nanopore - 40,000++ bp read length

DNA passing through Nanopore in an electrically resistant membrane, thus disturbing the current passing. Each nucleotide has a characteristic disturbance.

Can sequence both the strands as one contiguous read back to back

kmer approach

Lecture 3 slide 23

No need of amplification (no amplification bias), quality does not depend on read length.

Long reads can be around 40,000 bases or more

PacBio Single Molecule RealTime (SMRT) sequencing - 40,000 bp read length

DNA polymerase in each 20 zeptolitre reaction chambers/wells (Zero Mode Waveguides [ZMVs])

Free phospholinked nucleotides are attached with fluorophores. As bases get incorporated by the polymerase, the phosphate linked flourophore chains are severed off and normal DNA gets synthesised.

Each base has a colour and duration (indicating methylation, etc.). The colours are read and their order reveals the sequence of the DNA.

Lecture 3 slide 21,22

Can get 1 contig per chromosome

Celera Assembler Nu (CANU)

Short read sequencing

Illumina - 350 bp read length

Nucleotides tagged with fluorescence, DNA polymerase, primers attached to flow cells, adaptors attached to DNA fragments, ssDNA template attached to flow cell

LIBRARY PREPARATION - Break down DNA and attach adaptors to fragments

CLUSTER AMPLIFICATION (NO PCR) -
DNA fragments attach to templates through adaptors to initiate cycle. Through cycles of isothermal denaturation, bridge formation and cloning, cluster amplification takes place.

SEQUENCING - After amplification, free nucleotides with flourescence move through the cell and get incorporated. The flow cell's colour pattern is recorded each time, where each colour represents a base.

High throughput, low cost, low error rates and consistency

Errors

Not detecting phasing - can't detect variants in each cluster with high accuracy because so many cycles happen in each cluster, and so many reactions to incorporate the same base.

Crosstalk - Overlap of signals and diffuse outputs

Not detecting GC DNA because they can form secondary structures with the polymerase.

Error Management

Use of astronomy software to locate random arrays on surface

Image alignment after each cycle

Better image processing for to enable phasing and increase output resolution

Removal of adaptors

Types

Hi Seq x10

Human genome - 150 bp read length

Mi Seq

Bacterial genomes - 350 bp read length

Lecture 4, slide 21,22

Short reads can be around 100-1000bp (shortest was 18 bp)

Sanger Sequencing - > 500 bp read length

Low throughput - low number of sequences read at a time

Reaction chamber containing ssDNA template, oligonucleotide primers, di deoxynucleotide with fluorescence, regular nucleotides, DNA polymerase.

Primers attach to template and free nucleotides get incorporated by polymerase, until a ddNTP is incorporated. ddNTP TERMINATES synthesis.

The resulting different sized fragments are passed through gel electrophoresis, where fragments are pulled through a gel by the application of electricity. The smaller fragments travel farthest and hence they are separated by size.

As they pass through a flouroscent detection unit, the sequence of DNA is read, by reading the end base of each fragment of consecutive sizes.

Human Genome Project

lecture 4, slide 17-20

Lecture 4

Reads are assembled to overlap - ASSEMBLY (2 methods)

Consensus method - This process yields contiguous unbroken consensus sequences (from many overlapping reads (read depth)), known as contigs.

Ideally, we expect 1 contig per chromosome. But due to repeating regions being longer than read length, low sequencing depth and read errors, contigs can be broken and multiple, for a single chromosome/isolate

Output of the assembly process is hence a set of contigs.

Assembly is produced by tools such as SPAdes, Velvet, MegaHit, Skesa, Unicycler.

FASTA format, multi FASTA files

Lecture 3, slide 30, 31

FASTQ format, multi FASTQ files

Lecture 3, slide 33,34

Lecture 4, slide 4

Quality is encoded by letters or symbols for each base

Forward and reverse reads of paired reads in separate files

Quality Control

Phred Quality scores - measure of confidence of a base being called/assigned correctly during sequence assembly

Lecture 4, slide 9

FastQC - tool that displays data quality in a readset (average across reads for each position)

Drops as we go towards end of read - hence read length matters

Sequences unrelated to sample that exist in the sample

ADAPTORS - attached to ends of DNA fragments to facilitate attachment to primers and initiate synthesis - E.g. during PCR

ADDITIVE DNA - samples can also be spiked with noise to test the sequencing machine

Filter by:

Keeping all the reads and doing it using downstream software

Reject reads that have low quality

Trim reads from the ends to remove low quality portions

By Alignment

Kmer approach - Reads are converted to overlapping Kmers and these are assembled using De Bruijn graph method

Isolate Assemblies

Reference Alignment of reads - Read Mapping

De Novo Genome assembly of reads

Required for genome assembly of non-model organisms, novel sequences, novel splice variants (disease causing)

Read Alignment - SemiGlobal OR

To find position of sequenced reads in a reference - Read mapping

ASSEMBLY - Reads of colours (and hence bases) are read, followed by assembly to get the final sequence

READ QUALITY of a sequence - Average of phred scores across all the bases in a read sequence

When dealing with multiple read sets, MultiQC provides a summary of all FastQC reports

Tool for visualizing assemblies - Bandage

Remove multi mapping reads, which are very short reads - shorter than repeating sequences - will improve alignment runtime

High per base quality

Lower per base quality