Please enable JavaScript.

Coggle requires JavaScript to display documents.

Iceberg - Coggle Diagram

- - - - Scan Less, Save Us
    - - BinPack
        
        Simple Compaction
        
        Small Files to Large
      - Sort
        
        Sort the records while compaction
        
        Sort Order
      - Z-Order
        
        Equal Weight Sorting
  - - - Partition Columns
      - Queries Issues
    - - Metadata Tracked
    - - Time
        
        Years - Months - Days - Hours
      - String
        
        Truncate
      - Hash
        
        Bucket
        
        High Cardinality
    - - Different partitioning for the same table
      - Granularity
      - Applies only to new data when updated
- - - - Storage
      - File Format
        
        Row Oriented
        
        Column Oriented
      - Table Format
        
        Metadata Layer
        
        How data files should be laid out in storage
      - Catalog
        
        Table Information
        
        Where the tables data is stored in the storage
      - Storage Engine
        
        Doing the storage
      - Compute Engine
        
        Query & Process
  - - - Data is Locked
      - No Open Format
      - Only one compute engine
      - Only structured data
  - - - HDFS - MapReduce - Hive
      - Many Compute Engines
    - - No Table Format
      - Performance
      - Lack of ACID
  - - - Fewer Copies
      - Faster Queries
      - Mistakes Don't Hurt
      - Low Cost
      - Open Architecture
    - - Hive
        
        Directories & Sub-Directories
      - Modern
        
        Object Storage
        
        Table is a canonical list of files
- - - - Parquet
    - - Data Lake is immutable
      - Track Updates and Deletes
      - MOR
      - COW
      - Positional
      - Equality
  - - - Track Manifest Lists
      - Table Schema
      - Partition Information
      - current snapshot
    - - Snapshot
      - Contains Manifest Files
      - Statistics
    - - Track data layer files
      - partition Membership
      - Records Count
      - Statistics
- - - - current Metadata File
    - - Determine Schema to prepare internal memory
      - Understand Partitioning Schema
      - Current Snapshot ID
      - Locate Manifest List
    - - Manifest File Path
      - Partition Details
    - - Scan Data Files
      - Compare partition value go for the related data file
      - statistical Information
  - - - Engine Parse
      - create metadata
      - create data
      - update catalog
    - - Write Metadata File
        
        Table Schema
        
        Assign UUID
        
        Snapshot Created
      - Update Catalog
        
        metadata pointer
    - - Check the Catalog
        
        Determine Current Metadata Location
        
        Understand Schema
        
        Understand Partitioning
      - Write Data Files
        
        write parquet files based on partitioning
      - Write Metadata Files
        
        Manifest File
        
        AVRO Format
        
        Location of Data FIle
        
        Statistics
        
        Manifest List
        
        Manifest File Path
        
        No. of Data Files
        
        Statistics
        
        Metadata File
        
        New Snapshot
        
        Snapshot ID
        
        Manifest List Path
        
        Operation Summary
      - Update Catalog
        
        Ensure No Snapshots were committed at the same time
        
        update the pointed metadata file