Please enable JavaScript.
Coggle requires JavaScript to display documents.
Data Engineering - Coggle Diagram
Data Engineering
Transformation
Types of queries
Data Definition lannguage
CREATE database, DROP table, UPDATE table...
Data manipulation language
SELECT, INSERT, UPDATE, DELETE, COPY, MERGE
Data Control Language
GRANT, DENY, REVOKE
Transaction Control Language
COMMIT, ROLLBACK
-
-
-
Data modeling
-
-
-
Normalization
- Denormalized
- First normal form
- Second normal form
- Third normal form
Models
Inmon
All the data related s inside the same database. Than data can be splitted into data marts. Multiple sources but data ends up in the same place, highly normalized.
Usually this in OLAP
KimballRelated to data marts. This could lead to redundancy.
- Fact table: Insert only. Only registers events happened
- Dimensions table: they provide the reference data, attributes and context to events in fact table
What happens when we want to update a dimension?Slowly changing dimensions
- Type 1: overwrite existing records
- Type 2: use a column to flag when a record has been changed. It adds a new row
- Type 3: similar to Type 2 but ads a new column
-
-
-
-
-
Ingestion
Source systems
OLTP
Rows based
ACID
- Atomicity: several changes done as a unit
- Consistency: any query to the database will return the most recent data
- Isolation: in case 2 changes to the same resource arrive, the system will start with the first, and then the second
- Durability: committed data will never be lost, also in case of power failure
-
Relational databases
Data is stored in a table of relations (rows) and each relation contains multiple fields (columns)
-
API
REST
Dominant API paradigm, stands for Representational State Transfer
-
Webhooks
Also called Reverse API. When something specific happens in the source system, it triggers an HTTP request to a predefined URL in the data consumer's system.
Example:
The source system is a website. When a user clicks a button, the website sends an HTTP POST request to a monitoring system (owned by the website owner). The monitoring system receives this request and processes it, such as incrementing a counter by +1
RPC and gRPC
Remote Proceure Call, used in distrubuted computing
-
-
Database Logs
Each change requested to the database is forst written to a logs table. In this way we can recover also in case of power loss
Change Data capture
This method tracks each change that happens in a database. Indeed it can be used to create a database replica
Characteristics
-
-
-
Asynchronous
Systems are not tightly coupled. Thus if an event cannot be processed, this doesn't block the whole pipeline. And also parallelization is possible
Payload
- kind
- shape
- size
- schema and data types
- metadata
-
Batch ingestion patterns
-
-
File based export
In case of a data transfer between databases, instead of creating a database connection (that is risky) we first export source database data into files and then we upload the files in the target database
-
-
-
-
Data architecture
-
-
Tight coupled system
Multitier
Service built on more than 1 layer, but upper layers are dependent to lower layer (ex: if data laer fails, than also presentation layer fails)
Monoliths
All code is ran in the samae machine, so if this machine fails, everything fail
-
-
-
Storage
Data storage system
Distributed storage
Data is stored in multiple servers, thus the system has to provide redundancy in case a server fails
Eventual consistency
- Basically available: database reads and wirtes are made on best-effort basis, this means that data is available most of the time
- soft-state: it's uncertain if the transaction has been committed or not
- Eventual consistency: at some point, reading data will return consistent values
Strong consistency
The sustem first distribute the changes to every node, and then reads the data
-
-
Object storage
Like Amazon S3. Every time we write something, we create a new object. This means that also versioning is storage-intensive
Hadoop Distributed File System
Similar to object storage but Hadoop combines storage and computing power in the same node
Data CatalogsCentralized metadata store for all data across the organization (data lineage, data relationships, data descriptions).
- Automated scanning: to infer the metadata
- Data portal and social layer: web interfaces to discover data and read table descriptions
Schema
- Schema on write: a table has a schema and we must be compliant when writing data
- Schema on read: the schema is dinamically created when data it is written
-