Data Lake, Warehouse, Mart

Data Lake

Data Warehouse

Data Mart

Define

is a system used for reporting and data analysis and is considered a core component of business intelligence

are central repositories of integrated structured data from one or more disparate sources.

store current and historical data

approaches used to build a data warehouse system

ELT

ETL

maintains a staging area inside the data warehouse itself

data gets extracted from source systems
-> directly loaded into the data warehouse,
-> before any transformation occurs.

Step 1: Extract data from source -> store at the staging layer or staging database stores

Step 3: then moved data to data warehouse database

Step 2: transform data -> storing in an operational data store (ODS) database

Characteristics

1, Subject-oriented

  1. Time-variant

3, Non-volatile

4, Integrated

Architecture

1, Source Data

2, Data Staging (ETL)

3, Data-warehouse

4, Data Marts

External data

internal data

Operational system data

Flat file

Metadata

raw table

Summary data

Define

A data mart is a subset of a data warehouse focused on a particular line of business, department, or subject area

Benefits

EEx: a specific department in the business, such as finance, sales, or marketing

1, Cost-efficiency than DBW

2, Simplified data access

hold a small subset of data, so users can quickly retrieve the data they need with less work

3, Quicker access to insights

4, Simpler data maintenance

5, Easier and faster implementation

schema

star

vault

snowflake

type of data mart

dependence data mart

dependent on DBW

independent data mart

hibrid data mart

combine data from

DBW

other operational soures

Not rely on DBW

Compare with DBW

compare to DBW

DBW: top-down <> DM: bottom-up

DBW: centered data <> MD: not centered data

functions

1, Data consolidation

2, data cleaning

3, data integration

4, data storage

5, Data transformation

6, Data analysis

7, data mining

8, Data reporting

9, performance optimazation

define

is centralized repository for both structure, non-structure data

Big Data 4 V

Volumn

variety

varacity

data quanlity & availability

velocityy

the speed data is generated

many source, many format

components/ architecture

Data Ingestion

Data Processing

Data Storage

Data Exploration & Analytics

Data governance

enforcing policies

security

access control

tracking data lineage

maitaining data quanlity

data auditing

metadata management

data cataloging

1, data type

2, schema

pre-define <> non define

3, formar

from raw -> transformed <> processed

structure, semi-, no- <> structure

4, Scalability

easy, low cost <> difficult, expensive

5, users

6, usercase

ML, real-time analytics <> reporting, BI