Please enable JavaScript.
Coggle requires JavaScript to display documents.
A Case Study into Microbial Genome Assembly Gap Sequences and Finishing…
A Case Study into Microbial Genome Assembly Gap Sequences and Finishing Strategies
Objectives
Further research
finish microbial genome assemblies
gene models
using multiple assembly program + rRNA analysis
checking plasmid content
Basckground
Draft genome
misassembled region
incorrect gene calls
information is lacking on the nature of unassembled DNA
Association between gap DNA and raw data
long reads
short reads
Assembly and polishing improvement
base calling
hybrid assembly
Methods
Manual
Draft and hybrid genome assemblies were mapped to PacBio-only
PCR + Sanger sequencing = validation
Bioinformatic tools
Extended end that relative to contigs (overlap <10 kb and > 1% mismatches): Geneious software (8.1.6)
Limitation
No term of overlapping contigs evidences :red_cross:
No order and orientation adjoin contigs :red_cross:
same lineage had low average nucleotide identity (<90%) :red_cross:
Sequencing
Quality based trimming
Analysis of Gap
Assemblies
Illumina : SPAdes (3.5.0) / Abyss (1.5.2)
PacBio only : SMRT analysis software + HGAP/ Manual finishing
Post-assembly / base-call
PacBio : Quiver
Illumina : Pilon software
Gene calling and genome annotation :
Prodigal algorithm and microbial genome annotation pipeline :question:
determine :
DNA modification by database
non-chromosomal sequence by singleton sequence
PacBio + Illumina : SPAdes hybrid assembler(3.5.0)
Assembly summary statistics : Quast software version 2.3 :question:
Illumina : Genious software + manual inspection
PacBio : manual finishing + mfold web server (DNA folding) + PerPlot and PerScan tools (positional preference) :question:
Illumina : CLC Genomics workbench software
short reads (<20bp)
low quality score (<30) :question:
PacBio : SMRT Analysis software
Illumina
limitation : cannot sequence large repetitive region
200x coverage
short reads
pair end
sequencing by synthesis
PacBio
100x coverage
long reads
single molecule sequencing
limitation : require <10 contigs and > 100x coverage
Raw data
finish
near-finish
systematic evaluation
unassembled DNA region
properties
features
draft
Software
assembly
SPAdes
Abyss
HGAP
SMART
base call correction (SNP,indels)
Pilon
Quiver
iCORN
secondly structure
Mflod
Hypothesis
PacBio
Strong secondary structure (e.g. hair pin loop, super coil) : block DNA polymerase :green_cross:
GC content :green_cross:
cumulative effect : repetitive sequence and low coverage :check:
flaking gaps : high similarity score with another regions in the same genome -> repetitive DNA :check:
Illumina
repetitive DNA region (e.g. multiple rRNA operon) :check:
low coerage -> cumulative effect :check:
rRNA operon -> beakpoint :check:
high positional preference (>2.5) :check:
transposon :check:
Result
AD2
repetitive transposon DNA
super assembly
PCR + Sanger sequencing validation
long overlap (780 kb)
low similarity with another regoins in the same genome
DSM 2933
presence of an unknown gap
short overlap (6.7kb)
near-finished status
high similarity score with another regions in the same genome -> repetitive DNA
to obtain high-quality genome assembly
long read
post-assembly polishing steps,
gap closure strategies
working processes