Please enable JavaScript.
Coggle requires JavaScript to display documents.
DATA, ETL and DATA WAREHOUSE - Coggle Diagram
DATA, ETL and DATA WAREHOUSE
Data
Type of Data
Semi Structured Data
Definition
Examples
XML
HTML
JSON
Email
Web Pages
Difficult to manage
Unstructured Data
Definition
No field/attributes/Free from text - cannot be displayed in rows/colums
Example
Flat binary files containing Audio, video, and images
Est 80% of enterprise data (Gartner)
Advantages
Targeted for humans to process/digest
Requires more storage
Structured Data
Examples
Fields
Tables
Colums
RDBMS
Est. 20% of enterprise data (Gartner)
Spreadsheet
Definition
Have a predefined model, which organizes data into a form that is relatively easy to store, process, retrieve, and manage
Advantages
Targeted for computers to process
Requires less storage
Static Data
Dynamic Data
Definition
Data
Data means any representation of facts, number, measurements, graph, information in a form suitable for processing in a computer system
Information
Information, especially facts or numbers, collected to be examined and considered and used to help decision-making, or information in an electronic form that can be stored and used by a computer
Data Sources
Data Warehouse vs
Data Lake
Data Warehouse
Definition
Repository for data collected and generated by business applications for a predetermined purpose
Nature of Data
Structured, processed, predefined schema
Purpose
Currently in use
Processing
Schema-on-write (SQL
Vendors
AWS, Cloudera, IBM, Google, Microsoft,
Oracle, Teradata, SAP, Snowflake
Process
ETL
Users
Business Analyst, Manager
Cost
Expensive for large data volumes
Speed
Very Fast
Data Lake
Definition
Vast repository that stores raw data in
its native format.
Nature of Data
Any data in raw/native format, no predefined schema
Purpose
Not yet determined
Processing
Schema-on-read (No SQL)
Process
ETL
Vendors
AWS, Google, Informatica, Microsoft, Teradata and
other data management providers
Users
Data scientists
Cost
Designed for low-cost storage
Speed
Slow
Data Pipeline
Governance & Monitoring
Collection - Data ingestion
Processing - Validation, cleaning, and transformation
Data sources - Batch and streaming data
Storage - Data warehouse
Consumption - Advanced analytics, machine learning
Definition
A set of actions that ingest raw data from disparate sources and move the data to a destination for storage and analysis. A pipeline also may include filtering and features that provide resiliency against failure.
Specifies the business
logic of your data management
Extraction, Transformation, and
Load (ETL)
Definition
Process where data is taken from the source system, configured and stored in a data warehouse or database
The most underestimated process in DW development
The most time-consuming process in DW development. 80% of development time is spent on ETL
ETL tools automated data integration tasks
Process
Data Loading
Definition
Data loading takes the prepared data, applies it to the data warehouse, and stores it in the database
Terminology
Initial Load
— populating all the data warehouse tables for the very first time
Incremental Load
— applying ongoing changes as necessary in a
periodic manner
Full Refresh
— completely erasing the contents of one or more
tables and reloading with fresh data
Tasks
Append, Destructive, Merge, Constructive Merge, Insert Operation
Places the cleaned data into the DBMS in its final, useable form
Compare data from source systems and the Data Warehouse
Document the load information for the users
Loading the data into the data warehouse presentation area
Data Transformation
Definition
Transforms the data in accordance with the business rules and standards that have been established
Basic Tasks
Selection
Splitting/joining
Conversion
Summarization
Enrichment
Transformation Tasks
Format Revisions
Decoding of Fields
Calculated and Derived Values
Scoring
Spelling Mistakes
Data Extraction
Definition
Data from different source systems is converted into one consolidated data warehouse format which is ready for transformation processing
Tasks
Extracting data from homogeneous or heterogeneous sources
Reading and understanding the source data
Copying the data needed for the data warehouse into the staging area for
further manipulation
Transfers them to the Data Warehouse server
Issues
Source Identification—identify source applications and source
structures
Method of extraction—for each data source, define whether the
extraction process is manual or tool-based.
Extraction frequency—for each data source, establish how frequently the
data extraction must by done—daily, weekly, quarterly, and so on.
Time window—for each data source, denote the time window for the extraction process.
Job sequencing
Exception handling—determine how to handle input records that cannot be
extracted
Data Integration Across Sources
Same data
different name
Different data
Same name
Data found here
nowhere else
Different keys
same data
Purpose
Reshape the relevant data from the source systems into
useful information to be stored in the data warehouse
Advantages
Helps companies to analyze their business data for taking critical business decisions
Able to answer complex business questions
A Data Warehouse provides a common data repository
Provides a method of moving the data from various sources into a data warehouse
Data Warehouse will update automatically once data source changes
A well-designed and documented ETL system is almost essential to the success of a Data Warehouse project
Allow verification of data transformation, aggregation and calculations rules
Allows sample data comparison between the source and the target system
ETL VS ELT
ETL
Process - Data is transformed at the staging server and then transferred to Datawarehouse DB
Code Usage - Used for Compute-intensive Transformations & Small amounts of data
Transformation - Transformation is done in the ETL server/staging area
Time-Load - Data is first loaded into staging and later loaded into the target system. Time intensive
Time-Transformation - The ETL process needs to wait for the transformation to complete. As data size grows, transformation time increase
Time-Maintenance - It needs highs maintenance as you need to select data to load and transform
Implementation Complexity - At an early stage, easier to implement
Support for Data warehouse - ETL model used for on-premises, relational and structured data
Data Lake Support - Does not support
Cost - High cost for small and medium businesses
Lookups - In the ETL process, both facts and dimensions need to be available in the staging area
Aggregations - Complexity increase with the additional amount of data in the dataset
Calculation - Overwrites existing column or Need to append the dataset and push to the target platform
Maturity - The process is used for over 2 decades. It is well documented and best practices easily available
Hardware - Most tools have unique hardware requirements that are expensive
Support for Unstructured Data - Mostly supports relational data
ELT
Process - Data remains in the DB of the Datawarehouse
Code Usage - Used for high amounts of data
Transformation - Transformations are performed in the target system
Time-Loaded - Data loaded into target system only once. Faster
Time-Transformation - In ELT process, speed is never dependant on the size of the data
Time-Maintenance - Low maintenance as data is always available
Implementation Complexity - To implement ELT process organization should have deep knowledge of tools and expert skills
Support for Data warehouse - Used in scalable cloud infrastructure which supports structured, unstructured data sources. Allows use of Data lake with unstructured data
Complexity - This process involves development from the output-backward and loading only relevant data
Cost - Low entry costs using online Software as a service platform
Lookups - All data will be available because extract and load occur in one single action
Aggregations - Power of the target platform can process significant amount of data quickly
Calculations - Easily add the calculated column to the existing table
Maturity - Relatively new concept and complex to implement
Hardware - Being Saas hardware cost is not an issues
Support for unstructured data - Support for unstructured data readily available
Online Analytic Processing (OLAP)
Definition
powerful technology for data discovery,
including capabilities for limitless report viewing, complex analytical calculations, and predictive “what if” scenario (budget, forecast) planning
In OLAP database, there is aggregated, historical data, stored in multi-dimensional
schemas (usually star schema)
Characteristics
Low volume of transactions
Queries are often very complex and involve aggregations
A response time is an effectiveness measure
The main operational structure in OLAP is based on a concept called
cube
A cube in OLAP is a multidimensional data structure (actual or virtual) that allows fast
analysis of data.
CUBE
A data cube allows data to be modeled and viewed in multiple
dimensions
Operation
Roll up (drill-up)
Drill down (roll down)
Slice Dice
Pivot (rotate)
Dimensional Model
Implementation
Relational OLAP (ROLAP)
Multidimensional OLAP (MOLAP)
Hybrid OLAP (HOLAP) (e.g., Microsoft SQLServer)
Specialized SQL servers (e.g., Redbricks)
Problem with ER Diagrams
Too complex
Not useful for data warehouse
Spider web of all entities and relationship to other entities
Definition
An approach to database design that provides an easy to understand and navigate
database
Purpose
Generate clear, effective reporting and analysis
Well-suited to the generation of exactly the kind of information
business people need to see
Schema
Definition
Facts, dimensions, and attributes can be organized in several ways, called schemas
The choice of schema depends on variables such as the type of reporting that the model needs to facilitate and the type of Business Intelligence tool being used
"Classic" Star Schema
A single fact table, with detailed and summary data
Fact table primary key has only one key column per dimension
Each key is generated
Each dimension is a single table. highly denormalized
Benefits
: Easy to understand, easy to defined hierarchies, reduces # of physical joins, low maintenance, very simple metadata
Drawbacks
: Summary data in the fact table yields poorer performance for summary levels, huge dimension tables a problem
The biggest drawback
: Dimension tables must carry a level indicator for every record and every query must use it. In the example below, without the level constraint, keys for all stores in the NORTH region, including aggregates for region and district will be pulled from the fact table, resulting in error
Snowflakes schema
A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake
An extension of star schema where the diagram resembles a snowflakes in shape
Avoids redundancy of star schemas by normalizing dimension tables
Normalized tables optimize storage space, but decrease performance
Starflake schema: Combination of the star and snowflakes schemas, some dimensions normalize, other nor
The "Level" Problem
Level is a problem because it causes the potential for error. If the query builder, human or program, forgets about it, perfectly reasonable looking WRONG answer can occur
One alternative: the
FACT CONSTELLATION
model
Fact constellation schema
Multiple fact tables that share dimension tables
Viewed as a collection of stars, therefore called galaxy schema or fact constellation
DW Components
Data Staging
Extraction
- Gathers data from multiple heterogeneous data sources
May be operational databases or files in various formats
Internal or external
Uses APIs such as ODBC, JDBC for achieving interoperability
Transformation
Integration
Cleaning
Aggregation
Loading
Refreshing the data warehouse at a specified
frequency
Source Systems (Data sources)
Operational databases
Other internal or external sources of information (e.g. files)
OLAP
Provides multidimensional view from DWs and data
marts
ROLAP, MOLAP, or HOLAP
End User Data Access
OLAP tools
- Allow interactive exploration and manipulation of the
warehouse data
Data Mart
Dependent data mart
- A subset that is created directly from a data warehouse
Independent data mart
- A small data warehouse designed for a strategic business unit or a department
Data Warehouse
Enterprise data warehouse
- Centralized DW that encompasses all areas in an organization
Data mart
- Specialized DW targeted to a particular functional area or user group
Metadata repository
- Describes the content of the DW
Business metadata
Technical metadata
Metadata
Definition
Data about Data. It describe the contents and its acquisition and use to ease indexing and search
Include
Source System(s) of the Data, contact information
Related tables or subject areas
Programs or Processes which use the data
Population rules (Update or Insert and how often)
Status of the Data Warehouse’s processing and condition
Database
Database Background
Key
Definition
A key is one (or more) column(s) of a relation that is (are) used to
identify a row
Types
Secondary key
Foreign key
Candidate key
Primary key
Relationship
Types
Optionality
Cardinality
Many-to-many relationship (M:N)
One-to-many relationship (1:M)
One-to-one relationship (1:1)
Definition
Association between entities
Attribute
Example
Customer has name, street, city
Definition
Characteristic or property of an entity
Entity
Definition
A person, organization, object type, or concept about which information is stored.
Example
Person, place, object, event, or idea
Roles in the Database Environment
End Users (naive and sophisticated)
Business Analyst
Application Programmers
System Analysts
Data Administrator (DA)
Database Designers (Logical and Physical)
Database Administrator (DBA)
Database Concept
Data Manipulation Language (DML)
Provides basic data manipulation operations on data held in the database
General enquiry facility (query language) of the data
Data Definition Language (DDL)
Allows the DBA or user to describe and name entities, attributes, and relationships required for the application
All specifications are stored in the database
Permits specification of data types, structures and any data constraints.
Data Model
Integrated collection of concepts for describing data, relationships between data and constraints on the data in an organization
Structured Query Language (SQL)
Example
SELECT * FROM CUST WHERE CUST_ID = 1
Function SQL
Format
Insert, Delete, Modify data
Retrieve
Definition
International standard for creating, processing
and querying databases and their tables.
Entity-relationship (E-R) Diagram
Lines represent relationships between connected
entities
Rectangles represent entities
Visual way to represent a database
Impact of Databases and Database Technology
Personalized Applications:
based on smart mobile
devices
More recently:
Social Networks, Environmental and
Scientific Applications, Medicine and Genetics
Businesses:
Banking, Insurance, Retail, Transportation,
Healthcare, Manufacturing
Education:
Resources for content and Delivery
Service Industries:
Financial, Real-estate, Legal,
Electronic Commerce, Small businesses
Definition
Database
Information in an electronic form that can be
stored and used by a computer
Information, facts or numbers collected to be examined and considered and used to help decision-making
A collection of related data
Database System
The DBMS software together with the data itself.
Sometimes, the applications are also included
Database Management System (DBMS)
A software package or system to facilitate creation and maintenance of a computerized database
Historical Development of Database Technology
Big Data and NoSQL databases
Early Database Applications
Relational Model based Systems
Object Oriented and emerging applications
Data Warehouse and Data Mining
Data on the Web