Please enable JavaScript.
Coggle requires JavaScript to display documents.
Iceberg - Coggle Diagram
Iceberg
Optimization
Compaction
Why
Scan Less, Save Us
Compaction Strategies
BinPack
Simple Compaction
Small Files to Large
Sort
Sort the records while compaction
Sort Order
Z-Order
Equal Weight Sorting
Sorting
Limit the no. of scanned files
Partitioning
Partitioning History
Partition Columns
Queries Issues
Hidden Partitioning
Metadata Tracked
Transformers
Time
Years - Months - Days - Hours
String
Truncate
Hash
Bucket
High Cardinality
Partition Evolution
Different partitioning for the same table
Granularity
Applies only to new data when updated
MOR vs COW
Others
Metrics Collection
Partition - Sort - Compact
Ch1: History
OLAP
Architecture
Storage
File Format
Row Oriented
Column Oriented
Table Format
Metadata Layer
How data files should be laid out in storage
Catalog
Table Information
Where the tables data is stored in the storage
Storage Engine
Doing the storage
Compute Engine
Query & Process
Data Warehouse
OLAP Coupled System
Disadvantages
Data is Locked
No Open Format
Only one compute engine
Only structured data
Data Lake
Architecture
HDFS - MapReduce - Hive
Many Compute Engines
Challenges
No Table Format
Performance
Lack of ACID
Data Lakehouse
Data Lake + Table Format
Pros
Fewer Copies
Faster Queries
Mistakes Don't Hurt
Low Cost
Open Architecture
Table Formats
Hive
Directories & Sub-Directories
Modern
Object Storage
Table is a canonical list of files
CH2: Iceberg Architecture
Data Layer
Data Files
Parquet
Delete Files
Data Lake is immutable
Track Updates and Deletes
MOR
COW
Positional
Equality
Puffin Files
Metadata Layer
Metadata Files
Track Manifest Lists
Table Schema
Partition Information
current snapshot
Manifest Lists
Snapshot
Contains Manifest Files
Statistics
Manifest Files
Track data layer files
partition Membership
Records Count
Statistics
Catalog
Point of communication with the systems
latest metadata file
CH3: Operations
Read
Check Catalog
current Metadata File
Metadata File
Determine Schema to prepare internal memory
Understand Partitioning Schema
Current Snapshot ID
Locate Manifest List
Manifest List
Manifest File Path
Partition Details
Manifest Files
Scan Data Files
Compare partition value go for the related data file
statistical Information
Write
Steps
Engine Parse
create metadata
create data
update catalog
Create
Write Metadata File
Table Schema
Assign UUID
Snapshot Created
Update Catalog
metadata pointer
Insert
Check the Catalog
Determine Current Metadata Location
Understand Schema
Understand Partitioning
Write Data Files
write parquet files based on partitioning
Write Metadata Files
Manifest File
AVRO Format
Location of Data FIle
Statistics
Manifest List
Manifest File Path
No. of Data Files
Statistics
Metadata File
New Snapshot
Snapshot ID
Manifest List Path
Operation Summary
Update Catalog
Ensure No Snapshots were committed at the same time
update the pointed metadata file