Please enable JavaScript.
Coggle requires JavaScript to display documents.
DBMS - Coggle Diagram
DBMS
Databases
Basic Introduction
-
Model
Application Layer
-
-
For a user, this application tier presents an abstracted view of the database
-
-
The business logic of the application, which says what actions to carry out under what conditions, is embedded in the application server, instead of being distributed across multiple clients
Client Layer
End-users operate on this tier and they know nothing about any existence of the database beyond this layer
At this layer, multiple views of the database can be provided by the application
-
Data Layer
At this tier, the database resides along with its query processing languages
-
Data Model
Types
-
-
-
Relational Model
Tables
The relational model describes data at the logical
and view levels, abstracting away low-level details of data storage
A relational database consists of a collection of tables, each of which is assigned a unique name
The relational model uses a collection of tables to represent both data and the relationships among those data
-
There is a close correspondence between the concept of table and the mathematical concept of relation, from which the relational data model takes its name
In mathematical terminology, a tuple is simply a sequence
(or list) of values.
A relationship between n values is represented mathematically by an n-tuple of values, that is, a tuple with n values, which corresponds to a row in a table
in the relational model the term relation is used to refer to a table, while the term tuple is used to refer to a row
-
We use the term relation instance to refer to a specific instance of a relation, that is, containing a specific set of rows
-
-
A data model is a collection of conceptual tools for describing data, data relationships,data semantics, and consistency constraints.
-
-
-
-
Indexing
B+ Tree
-
-
-
-
-
Children
-
n is the order, branching factor or the fan-out of the B+ Tree
-
-
-
-
Node
Non-Leaf
-
-
Pi points to the subtree that contains search-key values less than Ki and greater than or equal to Ki −1
Contains up to n − 1 search-key values K1, K2,…, Kn−1
Contains up to n pointers P1, P2,…, Pn
The search-key values within a node are kept in sorted order; thus, if i < j, then Ki < Kj
Leaf Structure
For i = 1, 2,…, n − 1, pointer Pi points to a file record with search-key value Ki
Pn
Since there is a linear order on the leaves based on the search-key values that they contain, we use Pn to chain together the
leaf nodes in search-key order
If Li and Lj are leaf nodes and i < j (that is, Li is to the left of Lj in the tree), then every search-key value vi in Li is less than every search-key value vj in Lj
-
Order
The order, or branching factor, b of a B+ tree measures the capacity of nodes (i.e., the number of children nodes) for internal nodes in the tree
Duplicates
Approach 1
Modify the tree structure to store each search key at a leaf
node as many times as it appears in records, with each copy pointing to one record
-
This approach can result in duplicate search key values at internal nodes, making the insertion and deletion procedures more complicated and expensive
Approach 2
-
This approach is more complicated and can result in inefficient access, especially if the number of record pointers for a particular key is very large
Approach 3
-
-
Then the unique composite search key (ai, Ap) is used instead of ai when building the index
-
-
Non-Clustering Index
Indices whose search key specifies an order different from the sequential order of the file are called non-clustering indices, or secondary indices
Clustering Index
If the file containing the records is sequentially ordered, a clustering index is an index whose search key also defines the sequential order of the file
Clustering indices are also called primary indices; the term primary index may appear to denote an index on a primary key, but such indices can in fact be built on any search key
The search key of a clustering index is often the primary key, although that is not necessarily so
We assume that all files are ordered sequentially on some search key. Such files, with a clustering index on the search key, are called index-sequential files. They are designed for applications that require both sequential processing of the entire file and random access to individual records
-
Types
Ordered Indices
Types
Dense index
In a dense index, an index entry appears for every search-key value in the file
In a dense clustering index, the index record contains the search-key value and a pointer to the first data record with that search-key value
The rest of the records with the same search-key value would be stored sequentially after the first record, since, because the index is a clustering one, records are sorted on the same search key.
In a dense non-clustering index, the index must store a list of pointers to all records with the same search-key value
Sparse index
In a sparse index, an index entry appears for only some of the search-key values
Sparse indices can be used only if the relation is stored in sorted order of the search key; that is, if the index is a clustering index
As is true in dense indices, each index entry contains a search-key value and a pointer to the first data record with that search-key value
To locate a record, we find the index entry with the largest search-key value that is less than or equal to the search-key value for
which we are looking
We start at the record pointed to by that index entry and
follow the pointers in the file until we find the desired record
Multilevel Indices
-
Searching for records with a multilevel index requires significantly fewer I/O operations than does searching for records by binary search
Multilevel indices are closely related to tree structures, such as the binary trees used for in-memory indexing
Clustering/Primary Index
If the file containing the records is sequentially ordered, a clustering index is an index whose search key also defines the sequential order of the file
Clustering indices are also called primary indices; the term primary index may appear to denote an index on a primary key, but such indices can in fact be built on any search key
-
-
Index Entry
An index entry, or index record, consists of a search-key value and pointers to one or more records with that value as their search-key value
Index Criteria
Insertion time
-
This value includes the time it takes to find the correct place to insert the new data item, as well as the time it takes to update the index structure
Deletion time
-
This value includes the time it takes to find the item to be deleted, as well as the time it takes to update the index structure
Access time
The time it takes to find a particular data item, or set of items, using the technique in question
Space overhead
-
Provided that the amount of additional space is moderate, it is usually worthwhile to sacrifice the space to achieve improved performance
Access types
-
Access types can include finding records with a specified attribute value and finding records whose attribute values fall in a specified range
There is a trade-off that the system designer must make between access time and space overhead. Although the decision regarding this trade-off depends on the specific application, a good compromise is to have a sparse index with one index entry per block
-
Stores the values of the search keys in sorted order and associates with each search key the records that contain it
-
-
-
-
Normalization
Normal Forms
-
-
First Normal Form
Solutions
-
-
Multiple Columns, one column for each attribute
-
-
-
-
Functional Dependency
-
-
-
-
Closure Method
-
-
Closure of an attribute is the list of all attributes that can determined from it using functional dependency
Algorithm
-
-
Start adding to the candidate key and use attributes from RHS to add more attributes to the RHS using the functional relations
-
-
-
Database Engine
A database system is partitioned into modules that deal with each of the responsibilities of the overall system
Functional Components
-
Transaction Management
Transaction
A transaction is a collection of operations that performs a single logical function in a database application
-
Properties
ACID
Atomicity
Either all operations of the transaction are reflected properly in the database, or none are
Consistency
Execution of a transaction in isolation (i.e., with no other transaction executing concurrently) preserves the consistency of the database
Isolation
Even though multiple transactions may execute concurrently, the system guarantees that, for every pair of transactions Ti and Tj, it appears to Ti that either Tj finished execution before Ti started or Tj started execution after Ti finished
Thus, each transaction is unaware of other transactions executing concurrently in the system
Durability
After a transaction completes successfully, the changes it has made to the database persist, even if there are system failures
Aborted Transaction
-
Rollback
Once the changes caused by an aborted transaction have been undone, we say that the transaction has been rolled back
-
Log
-
We record the identifier of the transaction performing the modification, the identifier of the data item being modified, and both the old value (prior to modification) and the new value (after modification) of the data item
-
Transaction States
-
-
-
Aborted
After the transaction has been rolled back and the database has been restored to its state prior to the start of the transaction
-
-
Actions after abort
Restart
Restart the transaction, but only if the transaction was aborted as a result of some hardware or software error that was not created through the internal logic of the transaction
Kill
-
It usually does so because of some internal logical error that can be corrected only by rewriting the application program, or because the input was bad, or because the desired data were not found in the database
-
Compensating Transaction
Once a transaction has committed, we cannot undo its effects by aborting it
The only way to undo the effects of a committed transaction is to execute a compensating transaction
-
-
Storage Structure
Types of storage
Volatile storage
-
Access to volatile storage is extremely fast, both because of the speed of the memory access itself and because it is possible to access any data item in volatile storage directly
-
Non-volatile storage
-
Examples
Secondary Storage
- 3 more items...
Tertiary Storage
- 3 more items...
Non-volatile storage is slower than volatile storage, particularly for random access
Both secondary and tertiary storage devices, however, are susceptible to failures that may result in loss of information
Stable storage
For a transaction to be durable, its changes need to be written to stable storage
For a transaction to be atomic, log records need to be written to stable storage before any changes are made to the database on disk
-
Although stable storage is theoretically impossible to obtain, it can be closely approximated by techniques that make data loss extremely unlikely
To implement stable storage, we replicate the information in several non-volatile storage media (usually disk) with independent failure modes
Storage Manager
The storage manager is the component of a database system that provides the interface between the low-level data stored in the database and the application programs and queries submitted to the system
-
-
-
The storage manager is responsible for storing, retrieving, and updating data in the database
-
-
-
Database Design Phases
Conceptual Design Phase
The designer chooses a data model and, by applying the concepts of the chosen data model, translates these requirements into a conceptual schema of the database
-
-
Stated in terms of the entity-relationship model, the conceptual schema specifies the entities that are represented in the database, the attributes of the entities, the relationships among the entities, and constraints on the entities and relationships
Functional Requirements
Typically, the conceptual-design phase results in the creation of an entity-relationship diagram that provides a graphic representation of the schema
In a specification of functional requirements, users describe the
kinds of operations (or transactions) that will be performed on the data
Logical Design Phase
In the logical-design phase, the designer maps the high-level conceptual schema onto the implementation data model of the database system that will be used
The implementation data model is typically the relational data model, and this step typically consists of mapping the conceptual schema defined using the entity-relationship model into a relation schema
changes to the logical schema are usually harder to carry
out, since they may affect a number of queries and updates scattered across application code.
Initial Design Phase
The initial phase of database design is to characterize fully the data needs of the prospective database users
The database designer needs to interact extensively
with domain experts and users to carry out this task.
-
While there are techniques for diagrammatically representing user requirements, in this chapter we restrict ourselves to textual
descriptions of user requirements
Physical Design Phase
The designer uses the resulting system-specific database schema in the subsequent physical-design phase, in which the physical features of the database are specified
-
The physical schema of a database can be changed relatively easily after an application has been built
-
File Organization
-
Magnetic disks as well as SSDs are block structured devices, that is, data are read or written in units of a block
Databases deal with records, which are usually much smaller than a block
Most databases use operating system files as an intermediate layer for storing records, which abstract away some details of the underlying blocks
To ensure efficient access, as well as to support recovery from failures databases must continue to be aware of blocks
Given a set of records, the next decision lies in how to organize them in the file structure
A database is mapped into a number of different files that are maintained by the underlying operating system
-
-
-
Files are provided as a basic construct in operating systems, so we shall assume the existence of an underlying file system.
Each file is also logically partitioned into fixed-length storage units called blocks, which are the units of both storage allocation and data transfer
A block may contain several records; the exact set of records that a block contains is determined by the form of physical data organization being used
-
-
Record Length
Fixed-Length Records
-
Problems
Unless the block size happens to be a multiple of record(which is unlikely), some records will cross block boundaries
-
Solutions
-
-
When a record is deleted, we could move the record that comes after it into the space formerly occupied by the deleted record, and so on, until every record following the deleted record has been moved ahead
-
At the beginning of the file, we allocate a certain number of bytes as a file header
-
We use this first record to store the address of the second available record, and so on to create a free list
On insertion of a new record, we use the record pointed to by the header.
-
If no space is available, we add the new record to the end of the file
Insertion and deletion for files of fixed-length records are simple to implement because the space made available by a deleted record is exactly the space needed to insert a record
Variable Length Records
Reasons
Presence of variable length fields, such as strings
-
-
Problems
How to represent a single record in such a way that individual attributes can be extracted easily, even if they are of variable length
How to store variable-length records within a block, such that records in a block can be extracted easily
Record Representation
-
The values for the variable-length attributes are stored consecutively, after the initial fixed-length part of the record
-
Slotted Page Structure
-
-
-
-
-
-
This level of indirection allows records to be moved to prevent
fragmentation of space inside a block, while supporting indirect pointers to the record
The actual records are allocated contiguously in the block, starting from the end of the block
The free space in the block is contiguous between the final entry in the header array and the first record
Operations
Insertion
If a record is inserted, space is allocated for it at the end of free space, and an entry containing its size and location is added to the header
Deletion
If a record is deleted, the space that it occupies is freed, and its entry is set to deleted (its size is set to −1, for example)
The records in the block before the deleted record are moved, so that the free space created by the deletion gets occupied, and all free space is again between the final entry in the header array and the first record
-
Records can be grown or shrunk by similar techniques, as long as there is space in the block
The cost of moving the records is not too high, since the size of a block is limited: typical values are around 4 to 8 kilobytes
Record Organization
-
Types
Heap file organization
In a heap file organization, a record may be stored anywhere in the file corresponding to a relation
Once placed in a particular location, the record is not usually moved
When a record is inserted in a file, one option for choosing the location is to always add it at the end of the file
If records get deleted, it makes sense to use the space thus freed up to store new records
It is important for a database system to be able to efficiently find blocks that have free space, without having to sequentially search
through all the blocks of the file
Free-space map
-
The free-space map is commonly represented by an array containing 1 entry for each block in the relation
Each entry represents a fraction f such that at least a fraction f of the space in the block is free
The array is stored in a file, whose blocks are fetched into memory, as required
Whenever a record is inserted, deleted, or changed in size, if the occupancy fraction changes enough to affect the entry value,
the entry is updated in the free-space map
To find a block to store a new record of a given size, the database can scan the free-space map to find a block that has enough free space to store that record
If there is no such block, a new block is allocated for the relation
Sparse
Create a second-level free-space map, which has, say 1
entry for every 100 entries of the main free-space map
Update
Writing the free-space map to disk every time an entry in the map is updated would be very expensive
The free-space map is written periodically; as a result, the
free-space map on disk may be outdated, and when a database starts up, it may get outdated data about available free space
-
-
-
-
Partitioning
Many databases allow the records in a relation to be partitioned into smaller relations that are stored separately
-
Data Dictionary Storage
Relational schemas and other metadata about relations are stored in a structure called the data dictionary or system catalog
Information
-
-
-
Names of views defined on the database, and definitions of those views
Integrity constraints (e.g., key constraints)
Names of users, the default schemas of the users, and passwords or other information to authenticate users
-
The database may store statistical and descriptive data about the relations and attributes, such as the number of tuples in each relation, or the number of distinct values for each attribute
If relations are stored in operating system files, the dictionary would note the names of the file (or files) containing each relation
If the database stores all relations in a single file, the dictionary may note the blocks containing records of each relation in a data structure such as a linked list
-
-
-
-
The exact choice of how to represent system metadata by relations must be made by the system designers
-
Since system metadata are frequently accessed, most databases read it from the database into in-memory data structures that can be accessed very efficiently. This is done as part of the database startup, before the database starts processing any queries
Types
-
Every change in database structure (using DDL - Data Definition Language) is automatically reflected in the data dictionary
Most relational databases provide read access to its active data dictionary with a predefined set of read only tables or views that you can query to get access to database metadata - list of tables, columns, relationships etc
Different vendors use different names for its data dictionary - system catalog, catalog tables, data dictionary, information schema, but the idea is almost always the same
-
-
Changes in database structure need to be applied in passive data dictionary manually or with dedicated software
-
Queries are formulae, which define sets
-
-
-
-
A tuple is in the defined relation if and only if when substituted for a free variable, it satisfies (makes true) the formula
Result
-
The attributes of Result are either defined by name in f or
inherited from base relation R by a predicate Tε R
Query
-
-
-
-
Types
-
The relational algebra and the tuple relational calculus
over safe queries are equivalent in expressiveness
-
-
However, it asks for all tuples S such that S is not in (the given instance of) Sailors.
The set of such S tuples is obviously infinite, in the context of infinite domains such as the set of all integers
-
-
Safe Formula
-
For any given I, the set of answers for Q contains only values that are in Dom(Q, I)
For each subexpression of the form ∃R(p(R)) in Q, if a tuple r (assigned to variable R) makes the formula true, then r contains only constants in Dom(Q,I)
For each subexpression of the form VR(p(R)) in Q, if a tuple r (assigned to variable R) contains a constant that is not in Dom(Q, I), then r must make the formula true
-
-
-
Relational Algebra
The relational algebra consists of a set of operations that take one or two relations as input and produce a new relation as their result
Operations
Unary
-
Project
The project operation is a unary operation that returns its argument relation, with certain attributes left out
-
-
Rename
-
-
-
Uses
We may want to save the result of a relational algebra expression as a relation so that we can use it later
We may want to join a relation with itself, in that case, it becomes too confusing to specify which one of the tables we are talking about, in that case, we rename one of the tables and perform join operations on them
Binary
-
-
-
Join
-
Types
Inner Join joins two table on the basis of the column which is explicitly specified in the ON clause
The resulting table will contain all the attributes from both the tables including common column also
Inner join can have equality (=) and other operators (like <,>,<>) in the join condition
-
The resulting table will contain all the attributes of both the table but keep only one copy of each common column
A condition join is like an equi-join, except the condition being tested doesn't have to be an equality (although it can be). It can be any well-formed predicate.
-
-
-
-
-
Equi join can be an Inner join, Left Outer join, Right Outer join
A self JOIN is a regular join, but the table is joined with itself.
Left ⟕
Left Outer Join retrieves all the rows from both the tables that satisfy the join condition along with the unmatched rows of the left table.
Right ⟖
Right Outer Join retrieves all the rows from both the tables that satisfy the join condition along with the unmatched rows of the right table
-
-
Tuple t is r%s if
-
For every tuple ts in s, there is a tuple in r satisfying
-
-