Data Lake, Warehouse, Mart
Data Lake
Data Warehouse
Data Mart
Define
is a system used for reporting and data analysis and is considered a core component of business intelligence
are central repositories of integrated structured data from one or more disparate sources.
store current and historical data
approaches used to build a data warehouse system
ELT
ETL
maintains a staging area inside the data warehouse itself
data gets extracted from source systems
-> directly loaded into the data warehouse,
-> before any transformation occurs.
Step 1: Extract data from source -> store at the staging layer or staging database stores
Step 3: then moved data to data warehouse database
Step 2: transform data -> storing in an operational data store (ODS) database
Characteristics
1, Subject-oriented
- Time-variant
3, Non-volatile
4, Integrated
Architecture
1, Source Data
2, Data Staging (ETL)
3, Data-warehouse
4, Data Marts
External data
internal data
Operational system data
Flat file
Metadata
raw table
Summary data
Define
A data mart is a subset of a data warehouse focused on a particular line of business, department, or subject area
Benefits
EEx: a specific department in the business, such as finance, sales, or marketing
1, Cost-efficiency than DBW
2, Simplified data access
hold a small subset of data, so users can quickly retrieve the data they need with less work
3, Quicker access to insights
4, Simpler data maintenance
5, Easier and faster implementation
schema
star
vault
snowflake
type of data mart
dependence data mart
dependent on DBW
independent data mart
hibrid data mart
combine data from
DBW
other operational soures
Not rely on DBW
Compare with DBW
compare to DBW
DBW: top-down <> DM: bottom-up
DBW: centered data <> MD: not centered data
functions
1, Data consolidation
2, data cleaning
3, data integration
4, data storage
5, Data transformation
6, Data analysis
7, data mining
8, Data reporting
9, performance optimazation
define
is centralized repository for both structure, non-structure data
Big Data 4 V
Volumn
variety
varacity
data quanlity & availability
velocityy
the speed data is generated
many source, many format
components/ architecture
Data Ingestion
Data Processing
Data Storage
Data Exploration & Analytics
Data governance
enforcing policies
security
access control
tracking data lineage
maitaining data quanlity
data auditing
metadata management
data cataloging
1, data type
2, schema
pre-define <> non define
3, formar
from raw -> transformed <> processed
structure, semi-, no- <> structure
4, Scalability
easy, low cost <> difficult, expensive
5, users
6, usercase
ML, real-time analytics <> reporting, BI