Please enable JavaScript.
Coggle requires JavaScript to display documents.
Eleventh reading - Parallel Databases, Ariana Alvarado Molina - 2021089068…
Eleventh reading - Parallel Databases
Evolution of Parallel Databases
Despite skepticism two decades ago, parallel database systems are now widely marketed by various vendors.
Increased Transaction Demands
Growing computer usage and the rise of the World Wide Web have led to heightened transaction requirements for organizations.
Utilizing Massive Data
Organizations leverage vast datasets, such as purchase records and web interactions, for planning and pricing.
Parallelization Advantage
The set-oriented nature of database queries naturally allows for parallelization, showcasing the effectiveness and scalability in both commercial and research systems.
I/O Parallelism
Reducing disk retrieval time by splitting relations across multiple disks characterizes I/O parallelism.
The primary goal is to enhance the efficiency of fetching relations from disk.
The prevalent approach in parallel databases involves horizontal partitioning.
Partitioning Techniques
Round-robin
Scans relation in any order.
Sends each tuple to disk Di mod n for even distribution.
Hash Partitioning
Designates attributes as partitioning attributes.
Uses a hash function to place tuples on disks based on the result.
Range Partitioning
Distributes tuples based on attribute-value ranges.
Assigns contiguous ranges to each disk for efficient organization.
Comparison of Partitioning Techniques
Enables parallel retrieval and writing of relations, enhancing transfer rates.
I/O parallelism significantly accelerates reading or writing entire relations.
Data Access Types
Full Relation Scan
Scanning the entire relation.
Point Queries
Locating a tuple associatively using specific attribute values.
Range Queries
Locating tuples with attribute values within a specified range.
Different partitioning techniques :
Round-robin
Suited for sequential relation scans.
Complex for point and range queries as all disks must be used.
Hash Partitioning
Ideal for point queries on partitioning attributes.
Efficient for sequential relation scans, distributing workload evenly.
Range Partitioning
Well-suited for point and range queries on partitioning attribute.
Handling of Skew
Skew Types
Attribute-value skew
Occurs when some values frequently appear in partitioning attributes, leading to skewed partitioning.
Partition skew
Involves load imbalance in partitioning, even without attribute skew.
Impact of Skew
Skew, even minor, can significantly reduce performance.
Skew becomes more problematic with higher parallelism levels.
Challenges with Skew
Loss of speedup in parallel access increases with skew.
Constructing balanced range-partition vectors may incur extra I/O overhead.
Intraquery Parallelism
Intraquery Parallelism Overview
Refers to parallel execution of a single query on multiple processors and disks.
Crucial for accelerating long-running queries.
Two Forms of Parallelization
Intraoperation Parallelism:
Speeds up query processing by parallelizing individual operations (e.g., sort, select, project, join).
Interoperation Parallelism
Accelerates query processing by executing operations that don't depend on each other in parallel, possibly using pipeline mechanisms.
Interquery Parallelism
Different queries or transactions execute simultaneously, increasing transaction throughput.
Simplest in shared-memory parallel systems.
Coordinating tasks like locking and logging is more complex.
Protocols, integrated with concurrency control, guarantee cache coherency.
Intraoperation Parallelism
Relational operations operate on large sets of tuples.
Parallelizing operations involves executing them simultaneously on different subsets of relations.
Intraoperation parallelism is inherent in database systems due to the potentially large number of tuples
Homogeneou Distributed Databases
All sites have identical database-management system software.
Sites cooperate, surrendering some autonomy for seamless transaction processing.
Heterogeneous Distributed Database
Sites may use different schemas and database-management system software.
Limited cooperation in transaction processing.
Distributed Data Storage
Options for storing a relation in a distributed database: Replication and Fragmentation.
Data Replication
Advantages
Availability
Ensures data availability even if one site fails.
Increased Parallelism
Supports parallel processing of queries across multiple sites.
Reduced Data Movement
Minimizes data movement between sites.
Data Fragmentation
Relation Fragmentation
Division of relation r into fragments r1, r2,...,rn.
Fragments contain information for reconstructing the original relation r.
Transparency
Fragmentation Transparency
Users unaware of how a relation has been fragmented.
Replication Transparency
Users perceive each data object as logically unique.
Location Transparency
Users not required to know the physical location of data.
Distributed Transactions
Transaction Types
Local Transactions
Access/update data in one local database.
Global Transactions:
Global Transactions
Access/update data in multiple local databases.
ACID Properties
Preserved for both local and global transactions.
Challenges for Global Transactions
Complex due to multiple participating sites.
Failure at one site or communication link can lead to errors.
Sections Covered
System Structure and Failure Modes
Protocols for Atomic Commit
Concurrency Control Protocols
Maintaining Functionality Amidst Failures
System Structure
Components
Local Transaction Manager
Ensures ACID properties for local transactions.
Transaction Coordinator
Coordinates execution of all transactions at the site.
Transaction Manager Functions
Manages local transactions.
Maintains recovery log.
Participates in concurrency control.
Transaction Coordinator Functions
Starts execution of transactions.
Distributes subtransactions for execution at appropriate sites.
Failure Modes in Distributed Systems
Common Failure Types
Failure of a site.
Loss of messages.
Network partition.
Additional Challenges in Distributed Systems
Loss or corruption of messages is a constant risk.
Routing messages between non-directly connected sites involves potential rerouting.
Commit Protocols
All sites executing a transaction must agree on its final outcome.
Two-Phase Commit (2PC) is a widely used protocol for this purpose.
Phase 1 - Prepare
Coordinator adds <prepare T> to the log and sends a prepare T message to participating sites.
Sites respond with either <no T> (abort) or <ready T> (ready to commit).
Phase 2 - Commit or Abort
Coordinator determines based on responses:
If all sites are ready, it adds <commit T> to the log.
If any site is not ready, it adds <abort T> to the log.
Abort Before Ready
A site can unconditionally abort T before sending ready T to the coordinator.
Once ready T is sent, the transaction is in the ready state.
Failure Handling in 2PC Protocol
Failure of a Participating Site
Failure of a Participating Site
Coordinator assumes an aborted response if the site fails before responding with ready T.
If the failure occurs after the coordinator receives ready T, the protocol proceeds normally, ignoring the site failure.
Upon recovery, the participating site
Executes redo(T) if <commit T> is in the log.
Executes undo(T) if <abort T> is in the log.
Executes redo(T) or undo(T) based on the coordinator's decision.
Failure of the Coordinator
Participating sites decide the fate of T:
Commit if <commit T> is in the log.
Abort if <abort T> is in the log.
If no <ready T> record is found, assume T is aborted.
Network Partition
No effect if coordinator and participants remain in one partition.
If the coordinator and participants are in different partitions:
Sites in one partition execute the protocol assuming sites in other partitions have failed.
Sites in other partitions execute failure recovery protocol.
Disadvantage of 2PC
Coordinator failure may lead to blocking, delaying the decision (commit or abort) until the coordinator recovers.
Recovery and Concurrency Control
Recovery Procedure
Use a recovery algorithm (e.g., Section 16.4) when a failed site restarts.
Treat in-doubt transactions specially (transactions with <ready T> but without <commit T> or <abort T>).
Challenges in Recovery
Normal transaction processing is delayed until in-doubt transactions are resolved.
Finding the status of in-doubt transactions can be slow, involving communication with multiple sites.
Solution Using Lock Information in Log
Instead of <ready T> log record, use <ready T, L> where L is a list of all write locks held by transaction T.
At recovery time, for each in-doubt transaction T, reacquire all write locks noted in the <ready T, L> log record.
Benefits of Lock Information
Enables faster site recovery.
Allows transaction processing to start before the commit–abort status of in-doubt transactions is determined.
Three-Phase Commit
Extension of Two-Phase Commit (2PC)
Aims to avoid the blocking problem under specific assumptions.
Assumes no network partition and not more than k sites fail (predefined k).
Avoids Blocking
Introduces an extra third phase involving multiple sites in the decision to commit.
Coordinator ensures that at least k other sites know its intent to commit before noting the decision in persistent storage.
Handling Coordinator Failure
If the coordinator fails, remaining sites select a new coordinator.
New coordinator restarts the third phase if needed, or aborts the transaction if no informed site is available.
Alternative Models of Transaction Processing
Scenario
Consider funds transfer between different banks with individual computers.
Traditional two-phase commit may lead to blocking issues affecting all transactions at each bank.
Alternative Approach
Use persistent messaging for funds transfer.
Similar to the concept of a physical check being transferred between banks.
Message is guaranteed not to be delivered if the transaction aborts.
Implementation
Database recovery techniques are employed to implement persistent messaging over network channels.
Provides reliable and fast communication between distributed systems.
Error Handling Complexity
More complex than two-phase commit.
Requires error-handling code at both sending and receiving sites.
Exception conditions depend on the application.
Benefits and Applications:
Eliminates blocking issues, especially beneficial for transactions crossing organizational boundaries.
Widely used for transactions originating outside an organization to prevent local data access blocking.
Ariana Alvarado Molina - 2021089068