Please enable JavaScript.
Coggle requires JavaScript to display documents.
LU2: Biological Databases (2. Specialized sequence databases (Other…
LU2: Biological Databases
1. General sequence databases
nucleotide
protein
Primary
nucleotide
sequence databases
GenBank/EMBL/DDBJ =
International Nucleotide Sequence Database Collaboration (INSDC)
EMBL-Bank
(European Molecular Biology Laboratory)
Hosted by:
European Molecular Biology Laboratory (EMBL)
&
European Bioinformatics Institute (EBI)
, England, UK
Search system:
ENA browser
- supported by
Sequence Retrieval System (SRS)
&
dbfetch (Database fetch)
DDBJ
(DNA Data Bank of Japan)
Hosted by:
National Institute of Genetic (NIG)
&
Center for Information Biology (CIB)
Mishima, Japan
Search system:
All-round Retrieval of Sequence and Annotation (ARSA)
GenBank
Hosted by
National Center for Biotechnology Information (NCBI)
, USA
Based on
Entrez
search system
DATABASE FORMATS
(
Flat-file format
)
Based on
ASCII (American Standard Code for Information Interchange)
file structure
Stored data is structured as a
collection
of data entries
Advantages
Universality
- wide compatibility across different types of software and hardware
Understood
by human and computer
Economy of data storage space
Platforms for data submission
Submission Platforms
Webin
: web-based interface for data submission to
EMBL
database at the EBI
Sakura
: web-based interface for data submission to
DDBJ
Bankit
: NCBI-associated site for
GenBank
submission; simple web-based tool for basic submissions
Preparatory/Formatting Platforms
Sequin
: standalone and downloadable program/application, for formatting sequence data to be submitted via e-mail to
NCBI
and
EBI
tbl2asn
: similar to Sequin, and can be used in
combination with Sequin
Primary
protein
databases
GenPept
UniProt
UniProt Consortium -
European Bioinformatics Institute (EBI), Swiss Institute of Bioinformatics (SIB), Protein Information Resource (PIR)
UniProtKB
UniProtKB/Swiss-Prot
(containing
reviewed
,
manually
annotated entries)
a
curated
and annotated protein sequence database
has
high level of annotation and integration
with other databases, and
very low level of redundancy
'non-redundant'
in the sense that all protein products encoded by one gene are represented in a
single record (gene-based)
UniProtKB/TrEMBL
(containing
unreviewed
,
automatically
annotated entries)
auxillary
database to UniProtKB/Swiss-Prot
contains all the
translations of all coding sequences
in the European Nucleotide Archive (ENA) that are not yet integrated into UniProtKB/Swiss-Prot
once the entries are curated and of acceptable quality, they will be exported to UniProtKB/Swiss-Prot
'non-redundant'
in the sense that all identical, full-length protein sequences are represented in a
single record (sequence based)
;
separate entries
Information contain in protein databases
Primary amino acid
sequences
Secondary
structures
Protein
family domains
Consensus
active sites
etc.
Link to
structural databases
2. Specialized sequence databases
Characteristics
Search for information within a specific subject area
well-defined set of data
A "cleaned " database
information in a specific format
standardized annotation
Curated Nucleotide Sequence Databases
example:
RefSeq
(Reference Sequences) for all "worked" genome
data type:
genomic DNA, transcript (RNA), protein products for major organisms
source:
GenBank
ID format for RefSeq
(accession number follow a "2+6" format, with an underscore between letters and digits)
Experimentally determined sequences
NT_123456 (
genomic contigs
)
NM_123456 (
mRNAs
)
NP_123456 (
proteins
)
Computationally predicted sequences
XM_123456 (
model mRNAs
)
XP_123456 (
model proteins
)
Other specialized databases
For clustered sequences
UniGene
: a collection of
sequences
grouped by gene, together with information on
protein similarities, gene expression, cDNA clone reagents and genomic isolation
HOGENOM
: collection of
complete genomes
of
homologous genes families
Genome Database
- focus on one organism
Examples:
Colibase
(
E. coli
and related species),
GDB
(human),
Flybase
(Drosophila),
WormBase
(
C. elegans
),
AtDB
(Arabidopsis),
SGD
(
S. cerevisiae
)
Clinical and Mutation Databases
OMIM
(Online Mendelian Inheritance in Man): database of disease-linked genes and associated phenotypes
HGMD
(Human Genetic Mutation Database): database of sequences and phenotypes of disease-causing mutations
Protein domain databases
Pfam
: collection of multiple sequence alignments and covering many common protein domains and families
SMART
(Simple Modular Architecture Research Tool): identification and annotation of genetically mobile domains and the analysis of domain architectures
CDD
: combines SMART and Pfam databases
Gene Expression Database
RNA expression
: results of microarray experiments measuring the change in specific mRNA content under certain conditions;
Array Express (EBI)
and
Geo (NCBI)
Proteome databases
: 2D gel electrophoresis images representing the protein content of a cell or tissue under specific conditions;
SWISS 2D PAGE
Metabolic Pathways Databases
BRENDA
: enzyme database; has comprehensive information on enzymes and enzymatic reactions
KEGG Metabolic Pathways
: include graphical pathway maps for all known metabolic pathways from various organisms
The WIT Metabolic Reconstruction Project
: produces metabolic reconstructions for sequenced, or partially sequenced, genomes
Boehringer Mannheim - Biochemical Pathways
: searchable database of metabolic pathways, enzymes, substrates and products
Bioimage Database
Bisque Imaging Database
(Bio-Image Semantic Query User Environment)
Developed for the exchange and exploration of biological images
Bisque system supports image capture to image analysis and querying
Store, visualize, organize and analyze images in the cloud (digital storage)
3. Database searching
composite databases
search tools
Introduction
Choice of which dbs to search depends on:
purpose
of search/analysis
types
of dbs provided by the sites' search engine
Composite databases = Integrated databases
Different databases (dbs) used to solve different problems
Data integration system:
Entrez
, SRS
, etc
Searching the databases
Query
Searching through keywords
Similarity search
ENTREZ
The individual db are
interlinked
; can search all integrated dbs using one entry
The hard link concept
: applied between entries in different dbs and exist where there is a
logical
connection between entries
Interface through which all of its component db can be accessed and traversed -
an integrated information retrieval system
Search options
Method:
relevance pairs model of retrieval
Basic search: search
terms
without specifying
Boolean operators
and/or limits
Using Boolean Operator
Entrez
processes all Boolean operators in a
left-to-right
sequence
Enclosing individual concepts in
parentheses
changes this priority. The terms inside the parentheses are processed
first
as a unit and the incorporated into the overall strategy
Boolean operators
AND, OR, NOT
must be entered
uppercase
AND
- intersection/similarities
OR
- all/everything
NOT
- opposite & without intersection
Databases
Definition: an
organised
collection of data
searchable
(indexed) data management system
updated
periodically (new releases)
structured
data organisation
cross-referenced (
hyperlinks
) & associated tools
Structured data organisation
Annotation
description of the core data
Organized array of information
compartmentalization of information
Core data
main information of the database
Resource for other databases and tools