Please enable JavaScript.

Coggle requires JavaScript to display documents.

Designing Data-Intensive Applications by Martin Klepmann - Coggle Diagram

- - - - Project1: Fancy VR Mall (Importing to China)
        
        Data-intensive: because I don't need super fast CPU ML computing for too much feature. But the Hugh and complex data for VR and users usage are the bottlenecks.
      - Project2: Trading
        
        Data intensive: because here user will contribute data
        Compute-intensive: but computing speed and GPU and CPU cycles are crucial for ML decision making and online data mining.
      - Project3: E-commerce + Intl. Business (Offshore)
        
        Data-intensive: mostly need to retrieve new data and get information from the world. Fast response is not necessary.
    - - Increase the rate of faults by triggering them deliberately
        
        E.g. The Netflix Chaos Monkey
        
        Chaos Monkey randomly terminates virtual machine instances and containers that run inside of your production environment. Exposing engineers to failures more frequently incentivizes them to build resilient services.
      - Some fault simply cannot be cured, e.g. data leakage. So the only solution is to prevent future leakage.
    - - Types of Faults
        
        Hardware Faults
        
        RAM becomes faulty
        
        The power grid has a blackout
        
        Human errors
        
        Someone unplugs the wrong network cable
        
        Configuration errors by operators were the leading cause of outages, whereas hardware faults (serves or network) played a role in only 10-25% of outage.
        
        :red_flag: Preventions
        
        Provide fully featured non-production sandbox environments where people can explore and experiment safely, using real data, without affecting real users.
        
        Well-designed abstractions, APIs, and interface
        
        Test thoroughly at all levels
        
        Unit tests
        
        Whole-system integration tests
        
        Manual tests
        
        Automated testing
        
        Minimize the impact in the case of a failure
        
        Make it fast to roll back configuration changes (backups)
        
        Roll out new code gradually (small amount at a time)
        
        Providing tools to recompute data (in case it turns out that the old computation was incorrect) [panic button: don't let user to thing about how to solve, just given them the answer]
        
        Monitoring
        
        Set up detailed and clear monitoring, such as performance metrics and error rates. In other engineering disciplines this is referred to as telemetry. (Once a rocket has left the ground, telemetry is essential for tracking what is happening, and for understanding failures.)
        
        Management & Training
        
        Implement good management practices and training
        
        Hard disks crash
        
        Preventions
        
        Redundancy to the individual hardware components.
        
        Tolerating the loss of entire machines, by using software fault-tolerance techniques in preference or in addition to hardware redundancy.
        
        Advantages
        
        2 more items...
        
        :red_flag: Mean Time to Failure (MTTF) of about 10 to 50 years
        
        Software error
        
        E.g. Linux June 30, 2012 Leap Second Fault
      - :red_cross:Pitfalls
        
        :red_cross: Don't sacrifice reliability in order to reduce development or operational costs.
        
        E.g. when developing a prototype product for an unproven mar‐ ket
        
        E.g. for a service with a very narrow profit margin
      - BACKUP is always necessary
        
        Multi Hard Disk
        
        Multi Cloud
  - - - SQL
        
        Relational
        
        Using schema on-read
        
        Use descriptive query languages like SQL
        
        Relation orders are not guaranteed
        
        Direct index access
        
        MySQL is not a good choice for complex and large database
        
        Because the update and insertion into existing DB will regenerate the table, which can cause long down-time from minutes to hours.
      - NoSQL
        
        Document
        
        Need to traverse from on access path
        
        Graph
        
        Triple-stores
        
        RDF (Resource Description Framework) model
        
        Semantic web
        
        Pros
        
        Can directly access unique ID of vertices, or just use an index to find the vertices with a particular value
    - - Full Text Search
    - - Descriptive
        
        Relational
        
        SQL
        
        Graph
        
        Cypher (Neo4j)
        
        SPARQL (Triple-stores that using DRF model)
        
        DATALOG (Cascalog is an implementation of DATALOG for Hadoop large data query)
        
        Document (network)
        
        Aggregation (MongoDB Map-reduce)
        
        Pro
        
        Faster implementation and less error
        
        Query optimizer: write one time and use for all
        
        Vendor update and optimization won’t break code
      - Imperative (programming language style)
        
        Define: You need to write down the details and steps to accomplish the goal
        
        Pro
        
        Detailed and personalized in lower level
        
        Con
        
        Hard to optimize
        
        Write overhead: if there is any syntax or API change, you will need to change all the places
        
        Error Proning
    - - Query languages
        
        Just pick the descriptive and more abstract language for less update bugs and optimized performances in long term.
      - Data models
        
        Based on relations levels in application
        
        Not too much relations, and objects are mostly self-contained (Many-to-one)
        
        Document/Network data model (i.e. MongoDB)
        
        Supper strong and complex relationship between any kinds of vertices (Many-to-many)
        
        Graph Data models is perfect
        
        Relationships are almost fixed and all of them are predictable, but not too complicated connection logic (Many-to-many)
        
        Relational is perfect
- - - - When I design an application, what language should I choose in different situations?
      - Why is Microservices more efficient? What are the pros and cons?
      - How to guarantee the performance of Microservices?
      - How to migrate to Microservices?
      - What are the costs and benefits of distributed databases?
      - How do I actually implement operability, maintainability, and eco ability in my applications?
      - How to solve the problem of delay, stateless, and local testing in serverless application architecture?
    - - How to pick an efficient model
      - What are the pros and cons of different data models
        
        Relational
        
        Document
        
        Graph
        
        NoSql (No Only SQL)
        
        Pros
        
        Open source vs. commercial SQL
        
        More dynamic and expressive data models
        
        Easily achieving large datasets and high throughputs
        
        Specialized query operations
        
        SQL
        
        Cons
        
        Impedance mismatch: Boilerplates code required to translate the col and row relations into objects for OOP