Please enable JavaScript.
Coggle requires JavaScript to display documents.
DATA, DATA WAREHOUSE COMPONENT, DIMENSIONAL MODEL, DATABASE, SCHEMA,…
DATA
Data means any representation of facts, number, measurements, graph, or information
Type of Data
-
-
-
-
Structured Data
Definition
Predefined model, which organizes data into a form that is easy to store, process, retrieve, and manage
-
-
DATA WAREHOUSE COMPONENT
-
Metadata
Definition
- Data about Data. It describes the contents and its acquisition and use to ease indexing and search
Include
- System(s) of the Data, contact information
- Related tables or subject areas
- Programs or Processes which use the data
- Population rules (Update or Insert and how often)
- Status of the Data Warehouse’s processing and condition
-
OLAP
- Provides multidimensional view from DWs and data
marts
-
-
Data Mart
- Dependent data mart - A subset that is created directly from a data warehouse
- Independent data mart - A small data warehouse designed for a strategic business unit or a department
DIMENSIONAL MODEL
Problem with ER Diagrams
-
- Not useful for data warehouse
- Spider web of all entities and relationship to other entities
Purpose
- Generate clear, effective reporting and analysis
- Well-suited to the generation of exactly the kind of information
business people need to see
Definition
- An approach to database design that provides an easy to understand and navigate database
- Represented as a data cube or an hypercube
- DWs and OLAP use a multidimensional view of data
Implementation
Specialized SQL servers (e.g., Redbricks)
- Specialized support for SQL queries over star/snowflake schemas
Hybrid OLAP (HOLAP) (e.g., Microsoft SQLServer)
- Flexibility, e.g., low level: relational, high-level: array
- HOLAP, combine ROLAP and MOLAP
-
Relational OLAP (ROLAP)
- Use relational or extended-relational DBMS to store and manage warehouse data
-
Fact Table
- Facts provide the measurements of how well or how poorly the business is performing
-
Example: Sales, Expenses, Inventory Level
Dimension
- Contain attribute that describe data
- Address how data will be analyzed
- Business parameters that define a transaction
- Example: Analyst may want to view sales data (measure) by geography, by time, and
by product (dimensions)
DATABASE
Definition
-
Database System
The DBMS software together with the data itself.
Sometimes, the applications are also included
-
Database Concept
-
-
Data Model
Integrated collection of concepts for describing data, relationships between data and constraints on the data in an organization
-
-
-
-
Database Background
Entity
Example
Person, place, object, event, or idea
Definition
A person, organization, object type, or concept about which information is stored.
Relationship
Types
- One-to-one relationship (1:1)
- One-to-many relationship (1:M)
- Many-to-many relationship (M:N)
-
-
SCHEMA
"Classic" Star Schema
Definition
- A single fact table, with detailed and summary data
- Fact table primary key has only one key column per dimension
-
- Each dimension is a single table. highly denormalized
Benefits: Easy to understand, easy to defined hierarchies, reduces # of physical joins, low maintenance, very simple metadata
Drawbacks: Summary data in the fact table yields poorer performance for summary levels, huge dimension tables a problem
Definition
- Facts, dimensions, and attributes can be organized in several ways
- The choice of schema depends on variables such as the type of reporting that the model needs to facilitate, and the type of Business Intelligence tool being used
The "Level" Problem
- Level is a problem because it causes the potential for error. If the query builder, human or program, forgets about it, perfectly reasonable looking WRONG answer can occur
-
Snowflakes schema
-
Definition
- A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape like snowflake
Drawbacks: Normalized tables optimize storage space, but decrease performance
Starflake schema: Combination of the star and snowflakes schemas, some dimensions normalize, other not
EXTRACTION, TRANSFORMATION, AND
LOAD (ETL)
Definition
Process where data is taken from the source system, configured and stored in a data warehouse or database
Process
3. Data Loading
Definition
Takes the prepared data, applies it to the data warehouse, and stores it in the database
-
Tasks
- Append, Destructive, Merge, Constructive Merge, Insert Operation
- Places the cleaned data into the DBMS in its final, useable form
- Document the load information for the users
- Loading the data into the data warehouse presentation area
1. Data Extraction
Definition
Data from different source systems is converted into one consolidated data warehouse format which is ready for transformation processing
Tasks
- Extracting data from homogeneous or heterogeneous sources
- Reading and understanding the source data
- Copying the data needed for the data warehouse into the staging area for further manipulation
- Transfers them to the Data Warehouse server
Issues
-
-
Extraction frequency—establish how frequently the data extraction must be done—daily, weekly, quarterly, and so on.
-
-
Purpose
Reshape the relevant data from the source systems into
useful information to be stored in the data warehouse
Advantages
- Helps companies to analyze
business data for taking critical business decisions
- Able to answer complex business questions
- Provides a common data repository
- Provides a method of moving the data from various sources into a data warehouse
- Data Warehouse will update automatically once data source changes
- Essential to the success of a Data Warehouse project
- Perform complex transformations and requires the extra area to store the data
-
DATA PIPELINE
Governance & Monitoring
- Collection - Data ingestion
- Processing - Validation, cleaning, and transformation
- Data sources - Batch and streaming data
-
- Consumption - Advanced analytics, machine learning
Definition
A set of actions that ingest raw data from disparate sources and move the data to a destination for storage and analysis.
-
Purpose
- Moving data to the cloud or to a data warehouse
- Wrangling the data into a single location
- Integrating data from various connected devices and systems in IoT
- Copying databases into a cloud data warehouse
- Bringing data to one place in BI for informed business decisions
-
ETL VS ELT
ELT
- Process - Data remains in the DB of the Datawarehouse
- Code Usage - Used for high amounts of data
- Transformation - Transformations are performed in the target system
- Time-Loaded - Data loaded into target system only once. Faster
- Time-Transformation - In ELT process, speed is never dependent on the size of the data
- Implementation Complexity - To implement ELT process organization should have deep knowledge of tools and expert skills
- Support for Data warehouse - Used in scalable cloud infrastructure which supports structured, unstructured data sources. Allows use of Data lake with unstructured data
- Complexity - This process involves development from the output-backward and loading only relevant data
- Cost - Low entry costs using online Software as a service platform
- Lookups - All data will be available because extract and load occur in one single action
- Aggregations - Power of the target platform can process significant amount of data quickly
- Aggregations - Power of the target platform can process significant amount of data quickly
- Calculations - Easily add the calculated column to the existing table
- Maturity - New concept and complex to implement
- Hardware - Being Saas hardware cost is not an issue
- Support for unstructured data - Support for unstructured data readily available
ETL
- Process - Data is transformed at the staging server and then transferred to Datawarehouse DB
- Code Usage - Used for Compute-intensive Transformations & small amounts of data
- Transformation - Transformation is done in the ETL server/staging area
- Time-Load - Data is first loaded into staging and later loaded into the target system. Time intensive
- Time-Transformation - The ETL process needs to wait for the transformation to complete. As data size grows, transformation time increase
- Time-Maintenance - It needs high maintenance as you need to select data to load and transform
- Implementation Complexity - At an early stage, easier to implement
- Support for Data Warehouse - ETL model used for on-premises, relational and structured data
- Data Lake Support - Does not support
- Cost - High cost for small and medium businesses
- Lookups - In the ETL process, both facts and dimensions need to be available in the staging area
- Aggregations - Complexity increases with the additional amount of data in the dataset
- Calculation - Overwrites existing column or need to append the dataset and push to the target platform
- Maturity - The process has been used for over 2 decades. It is well documented and best practices easily available
- Hardware - Most tools have unique hardware requirements that are expensive
- Support for Unstructured Data - Mostly supports relational data
DATA, ETL, DATA WAREHOUSE