Please enable JavaScript.

Coggle requires JavaScript to display documents.

DBMS - Coggle Diagram

- - - - Disadvantages
        
        High Maintenance
        
        Advantages
        
        Provide better security as well as better performance than two-tier
        applications
      - Model
        
        Application Layer
        
        Handles query processing
        
        At this tier reside the application server and the programs that access the database
        
        For a user, this application tier presents an abstracted view of the database
        
        End-users are unaware of any existence of the database beyond the application
        
        The database tier is not aware of any other user beyond the application tier
        
        The business logic of the application, which says what actions to carry out under what conditions, is embedded in the application server, instead of being distributed across multiple clients
        
        Client Layer
        
        End-users operate on this tier and they know nothing about any existence of the database beyond this layer
        
        At this layer, multiple views of the database can be provided by the application
        
        All views are generated by applications that reside in the application tier
        
        Data Layer
        
        At this tier, the database resides along with its query processing languages
        
        We also have the relations that define the data and their constraints at this level
    - - Types
        
        Entity Relationship
        
        Network
        
        Hierarchical
        
        Relational Model
        
        Tables
        
        The relational model describes data at the logical
        and view levels, abstracting away low-level details of data storage
        
        A relational database consists of a collection of tables, each of which is assigned a unique name
        
        The relational model uses a collection of tables to represent both data and the relationships among those data
        
        Row
        
        A row in a table represents a relationship among a set of values
        
        There is a close correspondence between the concept of table and the mathematical concept of relation, from which the relational data model takes its name
        
        In mathematical terminology, a tuple is simply a sequence
        (or list) of values.
        
        A relationship between n values is represented mathematically by an n-tuple of values, that is, a tuple with n values, which corresponds to a row in a table
        
        in the relational model the term relation is used to refer to a table, while the term tuple is used to refer to a row
        
        The term attribute refers to a column of a table
        
        We use the term relation instance to refer to a specific instance of a relation, that is, containing a specific set of rows
        
        Semi-structured Data Model
        
        Object based Data Model
      - A data model is a collection of conceptual tools for describing data, data relationships,data semantics, and consistency constraints.
    - - Levels
        
        Physical Level
        
        Datastructure used for storing data, Index etc.
        
        Handled by DB admin
        
        Physical Schema
        
        Data maybe stored centrally or distributed
        
        Logical Level
        
        Conceptual Schema
        
        ER-Model
        
        DB Designer
        
        View Level
        
        Frontend Designer
        
        External Schema
        
        The interface for end-user
      - Motivation
        
        Data Independence
        
        Types
        
        Physical Data Independence
        
        Storage Structure
        
        index change
        
        DS Change
        
        Logical schema does not depend on where and how data is stored.
        
        Logical Data Independence
        
        Implemented using Views, only some columns are displayed
        
        Logical Scheme can change but view will not be affected
        
        Hide details of data location, format etc. from user
    - - Advantages
        
        Low Maintanence
      - Disadvantages
        
        Security
        
        Scalability
      - Model
        
        Client Machine
        
        Database Server
  - - - Role based security
    - - Constraints ensure there is no data redundancy
  - - - Check Constraint
      - Primary Key Constraint
      - Default Constraint
      - Not NULL Constraint
      - Unique Key Constraint
    - - In
      - All
      - Any
      - NotIn
    - - DCL
      - Data Manipulation Language(DML)
        
        A data-manipulation language ( DML ) is a language that enables users to access or manipulate data as organized by the appropriate data model
        
        The types of access are
        
        Retrieval of information stored in the database
        
        Insertion of new information into the database
        
        Deletion of information from the database
        
        Modification of information stored in the database
        
        Types
        
        Procedural DMLs
        
        Procedural DML s require a user to specify what data are needed and how to get those data
        
        Declarative DMLs
        
        Declarative DML s (also referred to as nonprocedural DML s) require a user to specify what data are needed without specifying how to get those data.
        
        Query
        
        A query is a statement requesting the retrieval of information
        
        The portion of a DML that involves information retrieval is called a query language
      - Data Definition Language(DDL)
        
        DDL expresses a set of definitions to specify a database schema
        
        DSDL
        
        We specify the storage structure and access methods used by the database system by a set of statements in a special type of DDL called a data storage and definition language.
        
        These statements define the implementation details of the database schemas, which are usually hidden from the users
        
        The DDL is also used to specify additional properties of the data
        
        Consistency Constraints
        
        Domain Constraints
        
        Referential Integrity Constraint
        
        Causes of violation
        
        Insertion in a referencing relation
        
        It is allowed to insert only those values in the referencing attribute which are already present in the value of the referenced attribute
        
        Inserting a value in the referencing attribute which is not present in the value of the referenced attribute violates the referential integrity constraint
        
        Deletion from a referenced relation
        
        It is not allowed to delete a row from the referenced relation if the referencing attribute uses the value of the referenced attribute of that row
        
        Such a deletion violates the referential integrity constraint
        
        Handling
        
        3 more items...
        
        Updation in a referenced relation
        
        It is not allowed to update a row of the referenced relation if the referencing attribute uses the value of the referenced attribute of that row
        
        Such an updation violates the referential integrity constraint
        
        Handling
        
        3 more items...
        
        Authorization
        
        Tuple Uniqueness Constraint
        
        Key Constraint
        
        Entity Integrity Constraint
  - - - Only leaf nodes have record pointer
      - All internal node keys are present in leaf nodes
      - Duplicates present in non-leaf nodes
      - Leaf nodes form a linked list
      - Dense Index
      - Children
        
        Non-Leaf
        
        Not root
        
        Between ⌈n / 2⌉ and n
        
        Root
        
        Between 2 and n children
        
        n is the order, branching factor or the fan-out of the B+ Tree
        
        Max Keys = Max Children - 1
        
        Min Keys = Min Children - 1
        
        Calculate Max/Min Key from Min/Max Key
      - Leaf
        
        Values
        
        ⌈(n − 1) / 2⌉ to (n - 1) values
      - Node
        
        Non-Leaf
        
        The non-leaf nodes of the B+-tree form a multilevel (sparse) index on the leaf nodes
        
        All pointers are pointers to tree nodes
        
        Pi points to the subtree that contains search-key values less than Ki and greater than or equal to Ki −1
        
        Contains up to n − 1 search-key values K1, K2,…, Kn−1
        
        Contains up to n pointers P1, P2,…, Pn
        
        The search-key values within a node are kept in sorted order; thus, if i < j, then Ki < Kj
        
        Leaf Structure
        
        For i = 1, 2,…, n − 1, pointer Pi points to a file record with search-key value Ki
        
        Pn
        
        Since there is a linear order on the leaves based on the search-key values that they contain, we use Pn to chain together the
        leaf nodes in search-key order
        
        If Li and Lj are leaf nodes and i < j (that is, Li is to the left of Lj in the tree), then every search-key value vi in Li is less than every search-key value vj in Lj
        
        The number of pointers in a node is called the fanout of the node
        
        Order
        
        The order, or branching factor, b of a B+ tree measures the capacity of nodes (i.e., the number of children nodes) for internal nodes in the tree
      - Duplicates
        
        Approach 1
        
        Modify the tree structure to store each search key at a leaf
        node as many times as it appears in records, with each copy pointing to one record
        
        The condition that Ki < Kj if i < j will need to be modified to Ki ≤ Kj
        
        This approach can result in duplicate search key values at internal nodes, making the insertion and deletion procedures more complicated and expensive
        
        Approach 2
        
        Store a set (or bucket) of record pointers with each search key value
        
        This approach is more complicated and can result in inefficient access, especially if the number of record pointers for a particular key is very large
        
        Approach 3
        
        Suppose the desired search key attribute ai of relation r is nonunique
        
        Let Ap be the primary key of r
        
        Then the unique composite search key (ai, Ap) is used instead of ai when building the index
        
        Any set of attributes that together with ai guarantee uniqueness can also be used instead of Ap
      - Queries
        
        Find
        
        Algo
        
        while (C is not a leaf node) begin
        
        Let i = smallest number such that v ≤ C.Ki
        
        if there is no such number i then begin
        
        Let Pm = last non-null pointer in the node
        
        Set C = C.Pm
        
        else Set C = C.Pi /* v < C.Ki */
        
        if for some i, Ki = v
        
        then return Pi
        
        else return null ;
        
        Set C = root node
        
        Range
    - - Indices whose search key specifies an order different from the sequential order of the file are called non-clustering indices, or secondary indices
    - - If the file containing the records is sequentially ordered, a clustering index is an index whose search key also defines the sequential order of the file
      - Clustering indices are also called primary indices; the term primary index may appear to denote an index on a primary key, but such indices can in fact be built on any search key
      - The search key of a clustering index is often the primary key, although that is not necessarily so
      - We assume that all files are ordered sequentially on some search key. Such files, with a clustering index on the search key, are called index-sequential files. They are designed for applications that require both sequential processing of the entire file and random access to individual records
    - - Explanation
      - N-Way Search Tree
      - Definition
        
        Every node should have at lease ceil(n/2) children
        
        Root can have minimum 2 children
        
        All leaves should be at the same level
        
        Creation process is bottom up
        
        Left/Right Bias
      - Dynamic Multilevel Index
      - Order
      - Degree
      - Nodes
        
        Leaf
        
        Leaf nodes are the same as in B+-trees
        
        n - 1 keys in leaf node
        
        Non-Leaf
        
        Pi
        
        The pointers Pi are the tree pointers that we used also for B+-trees
        
        Bi
        
        The pointers Bi are bucket or file-record pointers
        
        Ki
        
        Key value
        
        m - 1 keys in leaf node
    - - Ordered Indices
        
        Types
        
        Dense index
        
        In a dense index, an index entry appears for every search-key value in the file
        
        In a dense clustering index, the index record contains the search-key value and a pointer to the first data record with that search-key value
        
        The rest of the records with the same search-key value would be stored sequentially after the first record, since, because the index is a clustering one, records are sorted on the same search key.
        
        In a dense non-clustering index, the index must store a list of pointers to all records with the same search-key value
        
        Sparse index
        
        In a sparse index, an index entry appears for only some of the search-key values
        
        Sparse indices can be used only if the relation is stored in sorted order of the search key; that is, if the index is a clustering index
        
        As is true in dense indices, each index entry contains a search-key value and a pointer to the first data record with that search-key value
        
        To locate a record, we find the index entry with the largest search-key value that is less than or equal to the search-key value for
        which we are looking
        
        We start at the record pointed to by that index entry and
        follow the pointers in the file until we find the desired record
        
        Multilevel Indices
        
        Create sparse outer indices
        
        Searching for records with a multilevel index requires significantly fewer I/O operations than does searching for records by binary search
        
        Multilevel indices are closely related to tree structures, such as the binary trees used for in-memory indexing
        
        Clustering/Primary Index
        
        If the file containing the records is sequentially ordered, a clustering index is an index whose search key also defines the sequential order of the file
        
        Clustering indices are also called primary indices; the term primary index may appear to denote an index on a primary key, but such indices can in fact be built on any search key
        
        Files with a clustering index on the search key are called index-sequential files
        
        Non-Clustering/Secondary Index
        
        Indices whose search key specifies an order different from the sequential order of the file are called nonclustering indices, or secondary indices
        
        Index Entry
        
        An index entry, or index record, consists of a search-key value and pointers to one or more records with that value as their search-key value
        
        Index Criteria
        
        Insertion time
        
        The time it takes to insert a new data item
        
        This value includes the time it takes to find the correct place to insert the new data item, as well as the time it takes to update the index structure
        
        Deletion time
        
        The time it takes to delete a data item
        
        This value includes the time it takes to find the item to be deleted, as well as the time it takes to update the index structure
        
        Access time
        
        The time it takes to find a particular data item, or set of items, using the technique in question
        
        Space overhead
        
        The additional space occupied by an index structure
        
        Provided that the amount of additional space is moderate, it is usually worthwhile to sacrifice the space to achieve improved performance
        
        Access types
        
        The types of access that are supported efficiently
        
        Access types can include finding records with a specified attribute value and finding records whose attribute values fall in a specified range
        
        There is a trade-off that the system designer must make between access time and space overhead. Although the decision regarding this trade-off depends on the specific application, a good compromise is to have a sparse index with one index entry per block
        
        Based on a sorted ordering of the values
        
        Stores the values of the search keys in sorted order and associates with each search key the records that contain it
      - Hash Indices
  - - - R-W
      - W-W
      - W-R
    - - 2PL(Phase Lock)
      - Timestamp Protocol
    - - Atomicity
      - Isolation
      - Durability
      - Consistency
  - - - Relationship
        
        Relationship Table
        
        Descriptive Attribute
        
        Attributes of Relationship Tables
        
        Foreign Key
        
        Primary Key of another table
        
        Cardinality
        
        M-N
        
        Referencing Table
        
        Primary Key = Composite Key of the PK of the tables being related
        
        Cannot be reduced
        
        1-1
        
        1-1 Relationship Table Reduction by merging tables
        
        Can be merged to either table
        
        1-M or M-1
        
        Primary Key should be from M-side table
        
        Merging can be done for Relation table and M-side table
        
        Notation
        
        Min-Max Notation
        
        1..*, 0..1, m..n etc
        
        Non-Binary
        
        At most a single one cardinal entity is allowed in a non-binary relation, since an E-R diagram with two or more arrows out of a nonbinary relationship set can be interpreted in two ways
        
        Symbol
        
        Diamond
        
        Definition
        
        The entities that participate in a relationship are also called
        participants
        
        A relationship is an association between entities
        
        A relationship between entities always operates in both directions
        
        Participation Constraints
        
        Types
        
        Total Participation
        
        Each entity is involved in the relationship. Total participation is represented by double lines
        
        Symbol
        
        Shown with double lines
        
        Partial Participation
        
        Not all entities are involved in the relationship. Partial participation is represented by single lines
        
        Relationship Set
        
        A set of relationships of similar type is called a relationship set
        
        Degree of Relationship
        
        The number of participating entities in a relationship defines the degree of the relationship
        
        Types
        
        Binary = degree 2
        
        Ternary = degree 3
        
        n-ary = degree
        
        Role
        
        The function that an entity plays in a relationship is called that entity’s role
        
        Since entity sets participating in a relationship set are generally distinct, roles are implicit and are not usually specified
        
        They are useful when the
        meaning of a relationship
        needs clarification
        
        When the entity sets of a relationship set are not distinct; that is, the same entity set participates in a relationship set more than once, in different roles
        
        In this type of relationship set, sometimes called a recursive relationship set, explicit role names are necessary to specify how an entity participates in a relationship instance
        
        Keys
        
        Let R be a relationship set involving entity sets E1, E2,…, En
        
        Let primary-key(Ei) denote the set of attributes that forms the primary key for entity set Ei
        
        Relation
        
        R has no attributes
        
        primary-key(E1) ∪ primary-key(E2) ∪ ⋯ ∪ primary-key(En)
        
        R has attributes a1, a2,…, am
        
        primary-key(E1) ∪ primary-key(E2) ∪ ⋯ ∪ primary-key(En) ∪ {a1, a2,…, am}
        
        Super-Key
        
        primary-key(E1) ∪ primary-key(E2) ∪ ⋯ ∪ primary-key(En)
        
        For many-to-many this is minimal super-key and is chosen as the primary key
        
        For many-to-one and one-to-many, the primary key of the many side is a minimal super key and is chosen as the primary key
        
        For one-to-one, the primary key of any side is a minimal super key and either one can be chosen as the primary key
        
        If an entity set participates more than once in a relationship set the role name is used instead of the name of the entity set, to form a unique attribute name
        
        If the attribute names of primary keys are not unique across entity sets, the attributes are renamed to distinguish them; the name of the entity set combined with the name of the attribute would form a unique name
        
        Example
        
        E1, E2, E3 E4 with E3, E4 as one
        
        Primary Key
        
        E1, E2
        
        E1, E2, E3
        
        E1, E2, E4
        
        These are alternatives, so to avoid complication we tend to use only one arrow
        
        Merging
        
        Many to Many and both entities are in partial participation
        
        Cannot be Merged, needs 3 tables
        
        Many to Many and one of the entities is in partial participation
        
        Cannot be Merged, needs 2 tables
        
        Many to Many and both entities are in total participation
        
        Can be Merged with primary key pk(E1)+pk(E2)
        
        Many to One and both entities are in partial participation
        
        Cannot be Merged, needs 2 tables
        
        Many to One and only many side is in in partial participation
        
        Can be merged with primary key of new table = pk(E1)
        
        Concepts
        
        Merging will specify into which table we want to merge it into
      - Attributes
        
        Types
        
        Single Vs Multivalued
        
        Symbol
        
        Single -> Single Ellipse
        
        Multi -> Double Ellipse
        
        Handle multi-value in RDBMS
        
        Approach 1
        
        Create several new attributes, one for each component of the original muti-valued attribute
        
        The problem with this approach is that if there are many options for some tuples and not for all tuples, then the values for
        most of the new attributes is going to be NULL
        
        Approach 2
        
        Create a new entity composed by the original multivalued
        attribute’s components
        
        The new entity allows the designer to define different values for the attribute
        
        No need to change table structure
        
        This is the preferred way to deal with multivalued
        attributes
        
        Creating a new entity in a 1:M relationship with the original entity yields several benefits: it is more flexible, expandable solution, and it is compatible with the relational model
        
        Prime Attribute
        
        Prime attributes are the attributes of the candidate key which defines the uniqueness
        
        Key/Unique Vs Non-Key
        
        Key -> Underline of Attribute name inside ellipse
        
        Identifiers
        
        An identifier is one or more attributes that uniquely
        identify each instance or tuple
        
        In the relational model, entities are mapped to tables and the entity identifier is mapped as the table’s primary key (PK).
        
        Super Key
        
        A set of attributes (one or more) that collectively identifies an entity in an entity set
        
        Candidate Key
        
        A minimal super key is called a candidate key. An entity set may have more than one candidate key
        
        Primary Key
        
        A primary key is one of the candidate keys chosen by the database designer to uniquely identify the entity set
        
        Natural Key
        
        Primary key is made up of real data
        
        Surrogate Key
        
        When a primary key is generated at runtime, it is called a surrogate key
        
        A surrogate key is typically a numeric value
        
        Composite Key
        
        A composite key is simply a key that contains two or more columns
        
        (Base/Stored) Vs Derived
        
        Derived -> Dotted Ellipse
        
        Required Vs Optional
        
        Simple Vs Composite
        
        Composite -> Hierarchy of Ellipse
        
        Using composite attributes in a design schema is a good choice if a user will wish to refer to an entire attribute on some occasions, and to only a component of the attribute on other occasions
        
        Composite attributes help us to group together related
        attributes, making the modeling cleaner
        
        Composite attribute may appear as a hierarchy
        
        Complex (Composite + Multivalued)
        
        Symbol
        
        Ellipse
        
        Domains
        
        A domain is the set of possible values for a given attribute
        
        Attributes may share a domain
        
        The same attribute name is used for different entities, and then they share the same domain
        
        Also called value set
        
        Definition
        
        An entity is represented by a set of attributes
        
        Attributes are descriptive properties possessed by each member of an entity set
        
        The designation of an attribute for an entity set expresses that the database stores similar information concerning each entity in the entity set; however, each entity may have its own value for each attribute
        
        Each entity has a value for each of its attributes
        
        NULL Value
        
        An attribute takes a null value when an entity does not have a value for it
        
        The null value may indicate “not applicable”—that is, the value does not exist for the entity
        
        Null can also designate that an attribute value is unknown
        
        An unknown value may be either missing (the value does exist, but we do not have that information) or not known (we do not know whether or not the value actually exists)
      - Entity
        
        Symbol
        
        Rectangle
        
        An entity refers to any object having
        
        Either a physical existence such as a particular person, office, house or car
        
        Or a conceptual existence such as a school or a company
        
        Entity Set
        
        An entity set is a set of same type of entities
        
        Type
        
        Strong entity set
        
        A strong entity set possess its own primary key
        
        It is represented using a single rectangle
        
        Weak entity set
        
        A weak entity set does not possess its own primary key.
        
        It is represented using a double rectangle.
        
        Extension of Entity Set
        
        We use the term extension of the entity set to refer to the actual collection of entities belonging to the entity set
        
        Entity sets do not need to be disjoint. For example, it is possible to define the entity set person consisting of all people in a university. A person entity may be an instructor entity, a student entity, both, or neither.
        
        An entity is a “thing” or “object” in the real world that is distinguishable from all other objects
        
        Key
        
        A key for an entity is a set of attributes that suffice to distinguish
        entities from each other
        
        The concepts of superkey, candidate key, and primary key are
        applicable to entity sets just as they are applicable to relation schemas
    - - ER diagram or Entity Relationship diagram is a conceptual model that gives the graphical representation of the logical structure of the database
      - It shows all the constraints and relationships that exist among the different components.
      - It is a high-level data model
      - It is often used as a first step in database-schema design
      - Provides a means of identifying entities to be represented in the database and how those entities are related
      - Ultimately, the database design will be expressed in
        terms of a relational database design and an associated set of constraints
    - - Crow’s Foot
      - UML notations
      - Chen notation
  - - - Candidate Key + Extra Key
    - - Attribute
        
        Prime
        
        Non-Prime
      - Key that can determine all columns
  - - - Second Normal Form
        
        Criteria
        
        All the non-prime attributes should be fully dependent on the candidate key
        
        There should be no partial dependency in the relation
        
        Partial Dependency
        
        Where an attribute in a table depends on only a part of the primary key and not on the whole key.
        
        Must be in First Normal Form
        
        Solution
        
        Split into multiple tables
      - Third Normal Form
        
        Solution
        
        Finding Transitive Dependency
        
        For each FD
        
        LHS must be CK or SK
        
        RHS is a prime attribute
        
        Split into multiple tables
        
        Criteria
        
        Must be in Second Normal Form
        
        There should be no transitive dependency in the relation
      - First Normal Form
        
        Solutions
        
        Split into Base Table and Referencing Table
        
        Insert one row for each attribute
        
        Multiple Columns, one column for each attribute
        
        Criteria
        
        Single Valued Attributes
        
        Attribute Domain should not change
        
        Unique name for Attributes/Columns
        
        Order doesn't matters
    - - Row Level
        
        Solution: Use primary key which should be unique and non-NULL
    - - Deletion Anomaly
      - Insertion Anomaly
      - Updation Anomaly
  - - - Reflexivity
        
        If Y is a subset of X then X -> Y
      - Pseudotransitivity
        
        If X -> Y and WY -> Z then WX -> Z
      - Aumentation
        
        If X -> Y then XZ -> YZ
      - Union
        
        If X -> Y and X -> Z then X -> YZ
      - Composition
        
        If X -> Y and Z -> W then XZ -> W
      - Decomposition
        
        If X -> YZ then X -> Y and X -> Z
      - Transitive
        
        If X -> Y and Y -> Z then X -> Z
    - - Y is a subset of X
      - Intersection of XY will not be NULL
    - - X: Determinant Attribute
      - Y: Dependent Attribute
    - - This method can be used to find all candidate keys in a relation
      - Resources
        
        Gate Smashers
      - Closure of an attribute is the list of all attributes that can determined from it using functional dependency
      - Algorithm
        
        Combine all the RHS from the functional relations
        
        Add all attributes to the candidate which are not the combined RHS
        
        Start adding to the candidate key and use attributes from RHS to add more attributes to the RHS using the functional relations
      - Denoted by X+
    - - Form the closure for the LHS of the first FD using the functional dependencies of the second FD
      - The closure should meet the dependencies of the first FD(same or super set)
      - Apply the same logic swapping the two FDs
    - - Algorithm
        
        Canonical form is the irreducible form of the functional dependency
        
        Calculate the closure by removing one rule at a time
        
        Decompose each rule to have only one attribute on the RHS
        
        The rules which don't affect closure upon removal can be removed
      - Reduced form of the functional dependency
  - - - Query Processor
        
        Components
        
        DML compiler
        
        Query Optimization
        
        Query evaluation engine
        
        DDL interpreter
      - Transaction Management
        
        Transaction
        
        A transaction is a collection of operations that performs a single logical function in a database application
        
        Each transaction is a unit of both atomicity and consistency
        
        Properties
        
        ACID
        
        Atomicity
        
        Either all operations of the transaction are reflected properly in the database, or none are
        
        Consistency
        
        Execution of a transaction in isolation (i.e., with no other transaction executing concurrently) preserves the consistency of the database
        
        Isolation
        
        Even though multiple transactions may execute concurrently, the system guarantees that, for every pair of transactions Ti and Tj, it appears to Ti that either Tj finished execution before Ti started or Tj started execution after Ti finished
        
        Thus, each transaction is unaware of other transactions executing concurrently in the system
        
        Durability
        
        After a transaction completes successfully, the changes it has made to the database persist, even if there are system failures
        
        Aborted Transaction
        
        A transaction may not always complete its execution successfully
        
        Rollback
        
        Once the changes caused by an aborted transaction have been undone, we say that the transaction has been rolled back
        
        It is part of the responsibility of the recovery scheme to manage transaction aborts
        
        Log
        
        Each database modification made by a transaction is first recorded in the log
        
        We record the identifier of the transaction performing the modification, the identifier of the data item being modified, and both the old value (prior to modification) and the new value (after modification) of the data item
        
        Only then is the database itself modified
        
        Transaction States
        
        Active
        
        The initial state; the transaction stays in this state while it is executing
        
        Partially committed
        
        After the final statement has been executed
        
        Failed
        
        After the discovery that normal execution can no longer proceed
        
        Aborted
        
        After the transaction has been rolled back and the database has been restored to its state prior to the start of the transaction
        
        Committed
        
        After successful completion
        
        A transaction is said to have terminated if it has either committed or aborted
        
        Actions after abort
        
        Restart
        
        Restart the transaction, but only if the transaction was aborted as a result of some hardware or software error that was not created through the internal logic of the transaction
        
        Kill
        
        Kill the transaction
        
        It usually does so because of some internal logical error that can be corrected only by rewriting the application program, or because the input was bad, or because the desired data were not found in the database
        
        Observable external writes
        
        Writes to a user’s screen, or sending email etc.
        
        Once such a write has occurred, it cannot be erased, since it may have been seen external to the database system
        
        Most systems allow such writes to take place only after the transaction has entered the committed state
        
        One way to implement such a scheme is for the database system to store any value associated with such external writes temporarily in a special relation in the database, and to perform the actual writes only after the transaction enters the committed state
        
        If the system should fail after the transaction has entered the committed state, but before it could complete the external writes, the database system will carry out the external writes (using the data in non-volatile storage) when the system is restarted.
        
        Sometimes compensating transaction is needed if the external write was not successful
        
        Compensating Transaction
        
        Once a transaction has committed, we cannot undo its effects by aborting it
        
        The only way to undo the effects of a committed transaction is to execute a compensating transaction
        
        Components
        
        concurrency-control manager
        
        Concurrency
        
        Benefits
        
        Improved throughput and resource utilization
        
        Reduced waiting time
        
        Issues with Concurrency
        
        Conflicts
        
        2 more items...
        
        Incorrect Summary Problem
        
        2 more items...
        
        Phantom Read Problem
        
        3 more items...
        
        Schedule
        
        Serializable Schedule
        
        A serializable schedule is a schedule whose effect on any
        consistent database instance is guaranteed to be identical to that of some complete serial schedule over S Link
        
        Conflict serializability
        
        8 more items...
        
        View Serializability
        
        5 more items...
        
        Schedules represent the chronological order in which instructions are executed in the system
        
        Serial Schedule
        
        Each serial schedule consists of a sequence of instructions
        from various transactions, where the instructions belonging to one single transaction appear together in that schedule
        
        Cascade
        
        Cascadeless Schedule
        
        1 more item...
        
        Cascading Rollback
        
        3 more items...
        
        Recovery
        
        Recoverable Schedule
        
        1 more item...
        
        Partial Schedule
        
        1 more item...
        
        Transaction Dependency
        
        1 more item...
        
        Strict
        
        Strict Schedule
        
        Concurrency Control Schemes
        
        Lock Based Protocol
        
        Lock Types
        
        3 more items...
        
        We require that every transaction request a lock in an appropriate mode on data item Q, depending on the types of operations that it will perform on Q
        
        The transaction can proceed with the operation only after the concurrency-control manager grants the lock to the transaction
        
        The use of these two lock modes allows multiple transactions to read a data item but limits write access to just one transaction at a time
        
        Compatibility Function
        
        6 more items...
        
        Access
        
        4 more items...
        
        Deadlock
        
        5 more items...
        
        Locking Protocol
        
        7 more items...
        
        Timestamp-Based Protocols
        
        Timestamps
        
        4 more items...
        
        The timestamps of the transactions determine the serializability order. Thus, if TS(Ti) < TS(Tj), then the system must ensure that the produced schedule is equivalent to a serial schedule in which transaction Ti appears before transaction Tj
        
        W-timestamp
        
        1 more item...
        
        R-timestamp
        
        1 more item...
        
        These timestamps are updated whenever a new read(Q) or write(Q) instruction is executed
        
        The timestamp-ordering protocol ensures that any conflicting read and write operations are executed in timestamp order
        
        Operation
        
        3 more items...
        
        The timestamp-ordering protocol ensures conflict serializability. This is because conflicting operations are processed in timestamp order
        
        The protocol ensures freedom from deadlock, since no transaction ever waits
        
        There is a possibility of starvation of long transactions if a sequence of conflicting short transactions causes repeated restarting of the long transaction
        
        Recovery/Cascadelessness
        
        3 more items...
        
        Thomas' Write Rule
        
        3 more items...
        
        transaction manager
        
        Transaction Model
        
        read(X)
        
        write(X)
        
        Arithmetic operations on X
        
        Storage Structure
        
        Types of storage
        
        Volatile storage
        
        Information residing in volatile storage does not usually survive
        system crashes
        
        Access to volatile storage is extremely fast, both because of the speed of the memory access itself and because it is possible to access any data item in volatile storage directly
        
        Examples
        
        Main Memory
        
        Cache
        
        Non-volatile storage
        
        Information residing in non-volatile storage survives system crashes
        
        Examples
        
        Secondary Storage
        
        3 more items...
        
        Tertiary Storage
        
        3 more items...
        
        Non-volatile storage is slower than volatile storage, particularly for random access
        
        Both secondary and tertiary storage devices, however, are susceptible to failures that may result in loss of information
        
        Stable storage
        
        For a transaction to be durable, its changes need to be written to stable storage
        
        For a transaction to be atomic, log records need to be written to stable storage before any changes are made to the database on disk
        
        Information residing in stable storage is never lost
        
        Although stable storage is theoretically impossible to obtain, it can be closely approximated by techniques that make data loss extremely unlikely
        
        To implement stable storage, we replicate the information in several non-volatile storage media (usually disk) with independent failure modes
      - Storage Manager
        
        The storage manager is the component of a database system that provides the interface between the low-level data stored in the database and the application programs and queries submitted to the system
        
        The storage manager is responsible for the interaction with the file manager
        
        The raw data are stored on the disk using the file system provided by the operating system
        
        The storage manager translates the various DML statements
        into low-level file-system commands
        
        The storage manager is responsible for storing, retrieving, and updating data in the database
        
        Components
        
        Authorization and integrity manager
        
        Transaction manager
        
        File manager
        
        Buffer manager
        
        Data Structures
        
        Data files
        
        Data dictionary
        
        Indices
  - - - The designer chooses a data model and, by applying the concepts of the chosen data model, translates these requirements into a conceptual schema of the database
      - The schema developed at this conceptual-design phase provides a detailed overview of the enterprise
      - The entity-relationship model is typically used to represent the conceptual design
      - Stated in terms of the entity-relationship model, the conceptual schema specifies the entities that are represented in the database, the attributes of the entities, the relationships among the entities, and constraints on the entities and relationships
      - Functional Requirements
        
        Typically, the conceptual-design phase results in the creation of an entity-relationship diagram that provides a graphic representation of the schema
        
        In a specification of functional requirements, users describe the
        kinds of operations (or transactions) that will be performed on the data
    - - In the logical-design phase, the designer maps the high-level conceptual schema onto the implementation data model of the database system that will be used
      - The implementation data model is typically the relational data model, and this step typically consists of mapping the conceptual schema defined using the entity-relationship model into a relation schema
      - changes to the logical schema are usually harder to carry
        out, since they may affect a number of queries and updates scattered across application code.
    - - The initial phase of database design is to characterize fully the data needs of the prospective database users
      - The database designer needs to interact extensively
        with domain experts and users to carry out this task.
      - The outcome of this phase is a specification of user requirements
      - While there are techniques for diagrammatically representing user requirements, in this chapter we restrict ourselves to textual
        descriptions of user requirements
    - - The designer uses the resulting system-specific database schema in the subsequent physical-design phase, in which the physical features of the database are specified
      - These features include the form of file organization and choice of index structures
      - The physical schema of a database can be changed relatively easily after an application has been built
    - - Redundancy
      - Incompleteness
  - - - Fixed-Length Records
        
        We allocate the maximum number of bytes that each attribute can hold
        
        Problems
        
        Unless the block size happens to be a multiple of record(which is unlikely), some records will cross block boundaries
        
        It is difficult to delete a record from this structure
        
        Solutions
        
        We allocate only as many records to a block as would fit
        entirely in the block
        
        Any remaining bytes of each block are left unused
        
        When a record is deleted, we could move the record that comes after it into the space formerly occupied by the deleted record, and so on, until every record following the deleted record has been moved ahead
        
        Mark records that have been deleted
        
        At the beginning of the file, we allocate a certain number of bytes as a file header
        
        The header will contain the address of the first record whose contents are deleted
        
        We use this first record to store the address of the second available record, and so on to create a free list
        
        On insertion of a new record, we use the record pointed to by the header.
        
        We change the header pointer to point to the next available record
        
        If no space is available, we add the new record to the end of the file
        
        Insertion and deletion for files of fixed-length records are simple to implement because the space made available by a deleted record is exactly the space needed to insert a record
      - Variable Length Records
        
        Reasons
        
        Presence of variable length fields, such as strings
        
        Repeating fields such as arrays or multisets
        
        Multiple record types within a file
        
        Problems
        
        How to represent a single record in such a way that individual attributes can be extracted easily, even if they are of variable length
        
        How to store variable-length records within a block, such that records in a block can be extracted easily
        
        Record Representation
        
        Parts
        
        An initial part with fixed-length information
        
        Variable-length attributes, such as varchar types, are represented in the initial part of the record by a pair (offset, length), where offset denotes where the data for that attribute begins within the record, and length is the length in bytes of the variable-sized attribute
        
        The initial part of the record stores a fixed size of information about each attribute, whether it is fixed-length or variable-length
        
        Variable-length attributes
        
        The values for the variable-length attributes are stored consecutively, after the initial fixed-length part of the record
        
        Null Bitmap
        
        Indicates which attributes of the record have a null value
        
        Slotted Page Structure
        
        There is a header at the beginning of each block
        
        The number of record entries in the header
        
        The end of free space in the block
        
        An array whose entries contain the location and size of each record
        
        The slotted-page structure requires that there be no pointers that point directly to records
        
        Pointers must point to the entry in the header that contains the actual location of the record
        
        This level of indirection allows records to be moved to prevent
        fragmentation of space inside a block, while supporting indirect pointers to the record
        
        The actual records are allocated contiguously in the block, starting from the end of the block
        
        The free space in the block is contiguous between the final entry in the header array and the first record
        
        Operations
        
        Insertion
        
        If a record is inserted, space is allocated for it at the end of free space, and an entry containing its size and location is added to the header
        
        Deletion
        
        If a record is deleted, the space that it occupies is freed, and its entry is set to deleted (its size is set to −1, for example)
        
        The records in the block before the deleted record are moved, so that the free space created by the deletion gets occupied, and all free space is again between the final entry in the header array and the first record
        
        The end-of-free-space pointer in the header is appropriately updated as well
        
        Records can be grown or shrunk by similar techniques, as long as there is space in the block
        
        The cost of moving the records is not too high, since the size of a block is limited: typical values are around 4 to 8 kilobytes
    - - Organizing set of records in a file
      - Types
        
        Heap file organization
        
        In a heap file organization, a record may be stored anywhere in the file corresponding to a relation
        
        Once placed in a particular location, the record is not usually moved
        
        When a record is inserted in a file, one option for choosing the location is to always add it at the end of the file
        
        If records get deleted, it makes sense to use the space thus freed up to store new records
        
        It is important for a database system to be able to efficiently find blocks that have free space, without having to sequentially search
        through all the blocks of the file
        
        Free-space map
        
        A space-efficient data structure to track which blocks have free space to store records
        
        The free-space map is commonly represented by an array containing 1 entry for each block in the relation
        
        Each entry represents a fraction f such that at least a fraction f of the space in the block is free
        
        The array is stored in a file, whose blocks are fetched into memory, as required
        
        Whenever a record is inserted, deleted, or changed in size, if the occupancy fraction changes enough to affect the entry value,
        the entry is updated in the free-space map
        
        To find a block to store a new record of a given size, the database can scan the free-space map to find a block that has enough free space to store that record
        
        If there is no such block, a new block is allocated for the relation
        
        Sparse
        
        Create a second-level free-space map, which has, say 1
        entry for every 100 entries of the main free-space map
        
        Update
        
        Writing the free-space map to disk every time an entry in the map is updated would be very expensive
        
        The free-space map is written periodically; as a result, the
        free-space map on disk may be outdated, and when a database starts up, it may get outdated data about available free space
        
        Sequential file organization
        
        A sequential file is designed for efficient processing of records in sorted order based on some search key
        
        A search key is any attribute or set of attributes; it need not be the
        primary key, or even a superkey
        
        To permit fast retrieval of records in search-key order,
        we chain together records by pointers
        
        The pointer in each record points to the next record in search-key order
        
        To minimize the number of block accesses in sequential file processing, we store records physically in search-key order, or as close to search-key order as possible
        
        The sequential file organization allows records to be read in sorted order; that can be useful for display purposes, as well as for certain query-processing algorithms
        
        Insertion
        
        Locate the record in the file that comes before the record to be inserted in search key order
        
        If there is a free record (i.e., space left after a deletion) within the same block as this record, insert the new record there
        
        Otherwise, insert the new record in an overflow block
        
        If relatively few records need to be stored in overflow blocks, this approach works well
        
        The correspondence between search-key order and physical
        order may be totally lost over a period of time, in which case sequential processing will become much less efficient
        
        At this point, the file should be reorganized so that it is once
        again physically in sequential order
        
        The frequency with which reorganizations are needed depends on the frequency of insertion of new records
        
        Multitable clustering file organization
        
        Stores related records of two or more relations in each block
        
        The cluster key is the attribute that defines which records are stored together
        
        Clustering
        
        Example
        
        All the instructor tuples for a particular dept name are stored near the department tuple for that dept name.
        
        We say that the two relations are clustered on the key dept name
        
        it is possible to store the value of the dept name attribute, which defines the clustering, only once for a group of tuples (from both relations), reducing storage overhead
        
        This structure allows for efficient processing of the join
        
        Disadvantage
        
        Although a multitable clustering file organization can speed up certain join queries, it can result in slowing processing of other types of queries
        
        Example
        
        select * from department
        
        requires multiple block access
        
        B+-tree file organization
        
        Hashing file organization
        
        Partitioning
        
        Many databases allow the records in a relation to be partitioned into smaller relations that are stored separately
        
        Such table partitioning is typically done on the basis of an
        attribute value
    - - Relational schemas and other metadata about relations are stored in a structure called the data dictionary or system catalog
      - Information
        
        Names of the relations
        
        Names of the attributes of each relation
        
        Domains and lengths of attributes
        
        Names of views defined on the database, and definitions of those views
        
        Integrity constraints (e.g., key constraints)
        
        Names of users, the default schemas of the users, and passwords or other information to authenticate users
        
        Information about authorizations for each user
        
        The database may store statistical and descriptive data about the relations and attributes, such as the number of tuples in each relation, or the number of distinct values for each attribute
        
        If relations are stored in operating system files, the dictionary would note the names of the file (or files) containing each relation
        
        If the database stores all relations in a single file, the dictionary may note the blocks containing records of each relation in a data structure such as a linked list
        
        Name of the index
        
        Name of the relation being indexed
        
        Attributes on which the index is defined
        
        Type of index formed
      - The exact choice of how to represent system metadata by relations must be made by the system designers
      - The data dictionary is often stored in a nonnormalized form to achieve fast access
      - Since system metadata are frequently accessed, most databases read it from the database into in-memory data structures that can be accessed very efficiently. This is done as part of the database startup, before the database starts processing any queries
      - Types
        
        Active
        
        Active data dictionary is a data dictionary managed by DBMS
        
        Every change in database structure (using DDL - Data Definition Language) is automatically reflected in the data dictionary
        
        Most relational databases provide read access to its active data dictionary with a predefined set of read only tables or views that you can query to get access to database metadata - list of tables, columns, relationships etc
        
        Different vendors use different names for its data dictionary - system catalog, catalog tables, data dictionary, information schema, but the idea is almost always the same
        
        Passive
        
        Passive data dictionary is a data dictionary that is not part of and managed by the DBMS
        
        Changes in database structure need to be applied in passive data dictionary manually or with dedicated software
- - - - Safe query
        
        A query Q on a relational database with base schemas {Ri
        } is safe if and only if
        
        For each ∃T(p(T)) in the formula of Q, if p(t) is true for tuple t, then attributes of t are in Domain(I, Q)
        
        For each ∀T(p(T)) in the formula of Q, if t is a tuple containing a constant not in Domain(I,Q), then p(t) is true
        
        For all instances I of {Ri}, any tuple in Q(I) contains only values in Domain(I, Q)
        
        The relational algebra and the tuple relational calculus
        over safe queries are equivalent in expressiveness
        
        Consider the query {S I ~(S E Sailors)}
        
        This query is syntactically correct
        
        However, it asks for all tuples S such that S is not in (the given instance of) Sailors.
        
        The set of such S tuples is obviously infinite, in the context of infinite domains such as the set of all integers
        
        This simple example illustrates an unsafe query
        
        It is desirable to restrict relational calculus to disallow unsafe queries
        
        Safe Formula
        
        Consider a set I of relation instances
        
        For any given I, the set of answers for Q contains only values that are in Dom(Q, I)
        
        For each subexpression of the form ∃R(p(R)) in Q, if a tuple r (assigned to variable R) makes the formula true, then r contains only constants in Dom(Q,I)
        
        For each subexpression of the form VR(p(R)) in Q, if a tuple r (assigned to variable R) contains a constant that is not in Dom(Q, I), then r must make the formula true
        
        Simple way
        
        Just analyze and ensure result is not from outside domain
        
        Domain of Query
  - - - An atomic formula (uses predicate and constants)
        
        T ε R where
        
        T is a variable ranging over tuples
        
        R is a named relation in the database – a base relation
        
        T[a] op W[b]
        
        a and b are names of attributes of T and W, respectively
        
        op is one of < > = ≠ ≤ ≥
        
        T[a] op constant
        
        constant op T[a]
      - For any tuple relational calculus formulae f and g
        
        Boolean operations
        
        (f)
        
        not(f)
        
        f and g
        
        f or g
        
        Quantified
        
        ∃T( f (T) ) for T free in f
        
        ∀T( f (T) ) for T free in f
        
        Order of usage
        
        When to use existential and universal qualifiers?
        
        Think of ∃ as "there exists" which selects tuples which are applicable
        
        ∀ is used when we want the property to apply to tuples, so it cannot be the outermost quantifier unless we want to select everything. Check example in Ramakrishnan P156 Q14
- - - - Select
        
        The select operation selects tuples that satisfy a given predicate
        
        We use the lowercase Greek letter sigma (σ) to denote selection
        
        The predicate appears as a subscript to σ
        
        The argument relation is in parentheses after the σ
      - Project
        
        The project operation is a unary operation that returns its argument relation, with certain attributes left out
        
        Any duplicate rows are eliminated
        
        Projection is denoted by the uppercase Greek letter pi (Π)
      - Rename
        
        The RENAME operation is used to rename the output of a relation
        
        The symbol ‘ρ’ is used to denote the RENAME operator
        
        Cases
        
        Rename Relation
        
        σCondition(Student)
        
        Rename Relation and Attributes
        
        ρ Pro(P, Q, R) (Project)
        
        Rename Attributes
        
        ρ (A, B) (Department)
        
        Uses
        
        We may want to save the result of a relational algebra expression as a relation so that we can use it later
        
        We may want to join a relation with itself, in that case, it becomes too confusing to specify which one of the tables we are talking about, in that case, we rename one of the tables and perform join operations on them
    - - Union
      - Cartesian Product
      - Set Difference
      - Join
        
        Cross product + Select Operation
        
        Types
        
        Inner Join
        
        Inner Join joins two table on the basis of the column which is explicitly specified in the ON clause
        
        The resulting table will contain all the attributes from both the tables including common column also
        
        Inner join can have equality (=) and other operators (like <,>,<>) in the join condition
        
        Natural Join ⋈
        
        Natural Join joins two tables based on same attribute name and datatypes
        
        The resulting table will contain all the attributes of both the table but keep only one copy of each common column
        
        Conditional Join ⋈c
        
        A condition join is like an equi-join, except the condition being tested doesn't have to be an equality (although it can be). It can be any well-formed predicate.
        
        Equi Join ⋈
        
        An equi join is a type of join that combines tables based on matching values in specified columns
        
        The column names do not need to be the same
        
        The resultant table contains repeated columns
        
        It is possible to perform an equi join on more than two tables
        
        Special case of conditional join
        
        Equi join can be an Inner join, Left Outer join, Right Outer join
        
        Self Join
        
        A self JOIN is a regular join, but the table is joined with itself.
        
        Outer Join
        
        Left ⟕
        
        Left Outer Join retrieves all the rows from both the tables that satisfy the join condition along with the unmatched rows of the left table.
        
        Right ⟖
        
        Right Outer Join retrieves all the rows from both the tables that satisfy the join condition along with the unmatched rows of the right table
        
        Full ⟗
        
        Full outer join is the union of left outer join and right outer join
        
        Cross Join
        
        Cross Product
      - Divide
        
        Tuple t is r%s if
        
        t is in πR-S(r)
        
        For every tuple ts in s, there is a tuple in r satisfying
        
        tr[S] = ts[S]
        
        tr[R-S] = t