Please enable JavaScript.
Coggle requires JavaScript to display documents.
Designing Data-Intensive Applications by Martin Klepmann - Coggle Diagram
Designing Data-Intensive Applications
by Martin Klepmann
Textbook Chapters
Chapter1: Reliable, Scalable, ad Maintainable Applications
Classify my projects
Project1: Fancy VR Mall (Importing to China)
Data-intensive: because I don't need super fast CPU ML computing for too much feature. But the Hugh and complex data for VR and users usage are the bottlenecks.
Project2: Trading
Data intensive
: because here user will contribute data
Compute-intensive
: but computing speed and GPU and CPU cycles are crucial for ML decision making and online data mining.
Project3: E-commerce + Intl. Business (Offshore)
Data-intensive: mostly need to retrieve new data and get information from the world. Fast response is not necessary.
Strategies to increase fault-tolerance (-resilience) of systems
Increase the rate of faults by triggering them deliberately
E.g. The Netflix
Chaos Monkey
Chaos Monkey randomly terminates virtual machine instances and containers that run inside of your production environment. Exposing engineers to failures more frequently incentivizes them to build resilient services.
Some fault simply cannot be
cured
, e.g. data leakage. So the only solution is to
prevent
future leakage.
Reliability
Types of Faults
Hardware Faults
RAM becomes faulty
The power grid has a blackout
Human errors
Someone unplugs the wrong network cable
Configuration errors by operators were the leading cause of outages, whereas hardware faults (serves or network) played a role in only 10-25% of outage.
:red_flag: Preventions
Provide fully featured non-production
sandbox
environments where people can explore and experiment safely, using real data, without affecting real users.
Well-designed abstractions, APIs, and interface
Test thoroughly at all levels
Unit tests
Whole-system integration tests
Manual tests
Automated testing
Minimize the impact in the case of a failure
Make it fast to roll back configuration changes (backups)
Roll out new code gradually (small amount at a time)
Providing tools to recompute data (in case it turns out that the old computation was incorrect) [panic button: don't let user to thing about how to solve, just given them the answer]
Monitoring
Set up detailed and clear monitoring, such as performance metrics and error rates. In other engineering disciplines this is referred to as telemetry. (Once a rocket has left the ground, telemetry is essential for tracking what is happening, and for understanding failures.)
Management & Training
Implement good management practices and training
Hard disks crash
Preventions
Redundancy to the individual hardware components.
Tolerating the loss of entire machines, by using software fault-tolerance techniques in preference or in addition to hardware redundancy.
Advantages
2 more items...
:red_flag: Mean Time to Failure (MTTF) of about 10 to 50 years
Software error
E.g. Linux June 30, 2012 Leap Second Fault
:red_cross:Pitfalls
:red_cross: Don't sacrifice reliability in order to reduce development or operational costs.
E.g. when developing a prototype product for an unproven mar‐ ket
E.g. for a service with a very narrow profit margin
BACKUP is always necessary
Multi Hard Disk
Multi Cloud
Chapter 2: Data Models, and Query Languages
Common Data Models
SQL
Relational
Using schema on-read
Use descriptive query languages like SQL
Relation orders are not guaranteed
Direct index access
MySQL is not a good choice for complex and large database
Because the update and insertion into existing DB will regenerate the table, which can cause long down-time from minutes to hours.
NoSQL
Document
Need to traverse from on
access path
Graph
Triple-stores
RDF (Resource Description Framework) model
Semantic web
Pros
Can directly access unique ID of vertices, or just use an index to find the vertices with a particular value
Not Covered in This Book
Full Text Search
Query Languages
Descriptive
Relational
SQL
Graph
Cypher (Neo4j)
SPARQL (Triple-stores that using DRF model)
DATALOG (Cascalog is an implementation of DATALOG for Hadoop large data query)
Document (network)
Aggregation (MongoDB Map-reduce)
Pro
Faster implementation and less error
Query optimizer: write one time and use for all
Vendor update and optimization won’t break code
Imperative (programming language style)
Define: You need to write down the details and steps to accomplish the goal
Pro
Detailed and personalized in lower level
Con
Hard to optimize
Write overhead: if there is any syntax or API change, you will need to change all the places
Error Proning
:red_flag: Data Model and Query Languages selecting tips
Query languages
Just pick the descriptive and more abstract language for less update bugs and optimized performances in long term.
Data models
Based on relations levels in application
Not too much relations, and objects are mostly self-contained (Many-to-one)
Document/Network data model (i.e. MongoDB)
Supper strong and complex relationship between any kinds of vertices (Many-to-many)
Graph Data models is perfect
Relationships are almost fixed and all of them are predictable, but not too complicated connection logic (Many-to-many)
Relational is perfect
Questions
Technical
Consistency
Reliability
Design questions
When I design an application, what language should I choose in different situations?
Why is Microservices more efficient? What are the pros and cons?
How to guarantee the performance of Microservices?
How to migrate to Microservices?
What are the costs and benefits of distributed databases?
How do I actually implement operability, maintainability, and eco ability in my applications?
How to solve the problem of delay, stateless, and local testing in serverless application architecture?
Data model
How to pick an efficient model
What are the pros and cons of different data models
Relational
Document
Graph
NoSql (No Only SQL)
Pros
Open source vs. commercial SQL
More dynamic and expressive data models
Easily achieving large datasets and high throughputs
Specialized query operations
SQL
Cons
Impedance mismatch
: Boilerplates code required to translate the col and row relations into objects for OOP
Structural
Scaliaility
Efficiency
Helpfulness
You want to learn how to make data systems scalable, for example, to support web or mobile apps with millions of users.
You need to make applications highly available (minimizing downtime) and operationally robust.
You are looking for ways of making systems easier to maintain in the long run, even as they grow and as requirements and technologies change.
You have a natural curiosity for the way things work and want to know what goes on inside major websites and online services. This book breaks down the internals of various databases and data processing systems, and it’s great fun to explore the bright thinking that went into their design.
Definitions
Data-intensive Applications
Data is the primary challenge: the quantity of data, the complexity of data, or the speed at which it is changing.
Compute-intensive Applications
CPU cycles are the bottleneck.
Reliability
Tolerating hardware & software faults (Human error)
Scalability
Measuring load & performance (Latency percentiles, throughput)
Mantinability
Operability, simplicity & evolvability
:red_flag:
Fault
is not equal to
Failure
Fault
is usually defined as one component of the system deviating from its spec, whereas a
failure
is when the system as a whole stops providing the required service to the user.
Fault-tolerant or fault-resilient
Systems that anticipate faults and can cope with them. (not all faults)