Please enable JavaScript.

Coggle requires JavaScript to display documents.

Data Engineering - Coggle Diagram

- - - - Use indexes
      - New columns for join
  - - - Inmon
        All the data related s inside the same database. Than data can be splitted into data marts. Multiple sources but data ends up in the same place, highly normalized.
        Usually this in OLAP
      - Kimball
        Related to data marts. This could lead to redundancy.
        
        Fact table: Insert only. Only registers events happened
        
        Dimensions table: they provide the reference data, attributes and context to events in fact table
        
        What happens when we want to update a dimension?
        Slowly changing dimensions
        
        Type 1: overwrite existing records
        
        Type 2: use a column to flag when a record has been changed. It adds a new row
        
        Type 3: similar to Type 2 but ads a new column
      - Data vault
        Inserts data from a source system directly to insert-only tables
        
        HUB tables
        Store only ID fields
        
        LINK tables
        Used to create and maintain relationships among HUBs
        
        SATELLITE tables
        Represent attributes associated to ID in HUB tables
      - Wide denormalized tables
        All data inside the same table. This because storage is now very cheap
    - - Views
      - Materialized views
      - Composable materialized views
      - Federeted queries
      - Data virtualization
  - - - Hybrid Approach
        We retain the nested data inside the table, but we create coluns with the most frequent accessed values
- - - - ACID
        
        Atomicity: several changes done as a unit
        
        Consistency: any query to the database will return the most recent data
        
        Isolation: in case 2 changes to the same resource arrive, the system will start with the first, and then the second
        
        Durability: committed data will never be lost, also in case of power failure
    - - Key-Value stores
        Each record is identified with a key value. Useful when we need ultrafast lookups
      - Document stores
        Collection of documents retrived by a key
      - Wide column
        Used to store massive quantity of data, low latency
      - Graph Database
        Stores data with a mathematical graph structure
      - Search
        Designed to search semantic and structural characteristics of data
      - Time-series
        Values organized by time
    - - REST
        Dominant API paradigm, stands for Representational State Transfer
      - GraphQL
        Alternative for REST API
      - Webhooks
        Also called Reverse API. When something specific happens in the source system, it triggers an HTTP request to a predefined URL in the data consumer's system.
        Example:
        The source system is a website. When a user clicks a button, the website sends an HTTP POST request to a monitoring system (owned by the website owner). The monitoring system receives this request and processes it, such as incrementing a counter by +1
      - RPC and gRPC
        Remote Proceure Call, used in distrubuted computing
- - - - Event driven architecture
  - - - Data marts
        Smaller data warehouse focused on a specific line of business
    - - Data Lakehouse
        Data lake + controls and functionalities of a data warehouse
- - - - Eventual consistency
        
        Basically available: database reads and wirtes are made on best-effort basis, this means that data is available most of the time
        
        soft-state: it's uncertain if the transaction has been committed or not
        
        Eventual consistency: at some point, reading data will return consistent values
      - Strong consistency
        The sustem first distribute the changes to every node, and then reads the data