Please enable JavaScript.
Coggle requires JavaScript to display documents.
Data Intensive Applications (Data Models & Query Languages (Relational…
Data Intensive Applications
Design Goals
Reliability: should work correctly at desired level of performance in face of adversity
Software Faults
Human errors
Hardware Faults
Scalability: as system grows in data/traffic volume and complexity, there should be reasonable ways of dealing with that growth
Load: described with numbers called load parameters
number of active users
hit rate on a cache
ratio of reads to writes to db
requests per second to web server
Performance
depends on load parameters
Increase load, keep system resources constant, study performance
Increase load, see how much system resources have to be increased to maintain performance
Performance Numbers
Response Time - time for client to complete request
Latency - time spent by request waiting to be handled
Use Percentiles rather than averages to study performance
Coping with load
Scaling up: vertical scaling, moving up to a powerful machine
Scaling out: horizontal scaling, load across multiple smaller machines
mechanism
elastic: automatically add computing resources
manual: human detects and adds more resources
Maintainability: different people working on application should be able to work productively
design principles
Simplicity: make it easy to understand for new engineers by removing complexity
Evolvability: make it easy to change in the future
Operability: make it easy for ops team to run app smoothly
Data Models & Query Languages
Document
Schema on Read
Works well for tree type data with one to many relationships or no relationships
Not suited for many to many joins. Data has to be denormalized (duplicated) or joins have to be done in application
Every record has a single parent
Data locality - an advantage since all data related to a document is stored in one place as opposed to multiple tables
Query Language
Map Reduce - programming model for processing large amounts of data in bulk across many machines
Graph
Suitable when there are complicated many to many relationsips
A graph consists of two kinds of objects: vertices (also known as nodes or entities) andedges (also known as relationships or arcs)
Data models
Property Graph Model
Each vertex consists of: A unique identifier. A set of outgoing edges. A set of incoming edges. A collection of properties (key-value pairs)
Each edge consists of:A unique identifierThe vertex at which the edge starts (the tail vertex)The vertex at which the edge ends (the head vertex)A label to describe the kind of relationship between the two verticesA collection of properties (key-value pairs)
Query Language
Cypher
Triple Stores Model
In a triple-store, all information is stored in the form of very simple three-part statements:(subject, predicate, object)
For example, in the triple (Jim, likes, bananas), Jim isthe subject, likes is the predicate (verb), and bananas is the object.
Subject equivalent to Vertex in Property Graphs
Object
Primitive datatype
Or another vertex in the graph
Query Language
SPARQL
Relational
Data is organized into relations (called tables in SQL), where each relation is an unordered collectionof tuples (rows in SQL)
Well suited for many to many relationships, allows for normalization. DB does joins
Query optimizers can use indexes
Schema on Write
Now allow for storing of XML/Json data, so emulates a document model
Multiple tables can lead to cumbersome schema
Query Language
SQL - declarative language
Lends it self to parallel processing
Network
CODASYL model
Record can have multiple parents
Supports one to many and many to many models