Advanced Data Warehousing (tips (cardinality (you have 3 years of…
Advanced Data Warehousing
Advanced schema design
Completely Normalized Snowflake Schema
Only the immediate parent ID stored in child tables
more joins in SQL
Moderately Denormalized Snowflake Schema
the only reason why to introduce redundancy is to improve query performance.
Slightly larger tables
All higher-level IDs stored in child tables
grandparent, grand-grandparent, and so on .
Fewer joins in SQL.
Completely Denormalized Snowflake Schema
Much larger tables
All higher-level IDs/Desc stored in child tables.
Fewest joins in SQL
Store all information in only one lookup table, no separated tables for parent level and grandparent level.
Very few tables.
Very larger tables
All IDs/Desc for all attributes in a hierarchy stored in a single table.
Single join in SQL.
If you want to use aggregate fact table, you have to use snowflake schema, 就是说，需要有aggregated lookup table for higher level attributes.因为用base fact table和aggregate fact table做join会使得得到的数值是实际数值的若干倍，倍数是相应的record数目。这是因为做join的fact table和lookup table的level不匹配。
data warehouse optimization
very good for large cardinality attribute
因为是基于primary key的，所以easier to maintain（因为不需要重新计算key）
requires the database to generate a binary string for each row
so this is better for attribute that is with a lower cardianlity
The table will be organized, that being said, to change the order of rows, to make the table data stored with rows index organized.
嵌套排序，比如按照定义有Store_ID, Date_ID, Cust_ID.则按照从外到内Store_ID, Date_ID, Cust_ID的顺序将table中的row进行排序。
The index needs no additional space to store. You build the index into the table, when generate the table.
The index is build , or we put the data into this order, when moving the data from the data source to the data warehouse.
excellent solution for large tables.
Where to build index on?
denormalized table will store foreign keys. Those are often used to join are good to build index on.
frequently filtered elements
比如总做state的filter， 就可以在fact table的state_ID这个column上做index
Columns that are often used to join with other tables are excellent candidates to build index.
It's recommended to store the index on another machine. data on one machine, and index on another machine. To balance the load of the server.
Satisfy Report Requirements
Increase Query Performance
Denormalizing: introduce redundancy
transaction database: standard is to make your database normalized, so it's easy to write into the database, and the database is easier to maintain.
But the data warehouse is a different story. Once the information is put into the data warehouse, it's not going to change a lot. The most important issue is to run query against it. This is why we see denormalization in data warehouse.
store parent id in the child table in order to build the relationship.
you have 3 years of information, and you have a lookup table for that, the cardinality of the lookup table is 3.
number of rows in a table, 用来说明表的大小/尺寸
Advanced Data Modeling
Direct and Indirect Attribute Relationships
defined by lookup table
three types of direct attribute relationships
you can keep information for both attributes in the same table.
You need to have a separate relationship table.
scenario 1, analytical capability
how many items in a certain color were sold?
keep analytical capability in your data model
the fact table doesn't cover all the possible combinations, for this you need the relationship table.
scenario 2, total sales for red items
three ways to resolve this
a separate relationship table
a compound child attribute
use "Item_ID" and "Color_ID" together as the key for Item
a common child attribute
relationship through a fact
defined in a fact table
e.g, Item-Date <Revenue table> Item table joins this fact table, and date table joins this fact table.
in MSTR you only need to define direct relationships, indirect relationships are detected in fact tables for "join"
e.g. "Customer City" in "Customer" hierarchy, and "Store City" in "Store" hierarchy
Solution 1. Create a "Table View" for each attribute role;each attribute references its corresponding table view. This table view is created in the data warehouse.
Cons: create and have to maintain multiple views;
Solution 2. Use "Automatic Attributes Role Recognition" in VLDB Properties in Database Instance level
Query optimizations -> Engine Attribute Role Options
Limitations: the two attribute roles cannot be in a same hierarchy or have a common child; the alias is created in the memory, so there is a 100 role limit an attribute.
Solution 3. Create Explicit Table Alias
Solution 4. Create logical views.
a gap in the hierarchies in some cases.
Solution 1. Revising the Data Model
The Region_ID column is added to the LU_ACCT_EXEC table to directly relate regions to account executives.
Solution 2. Null Values
resolve by creating joint child relationships.
Solution 1: Flattening the table
Solution 2: Relationship Table
As Is v.s. As Is
Only current data
this is the usual option, because there usually is not historical records available
As Is v.s. As Was
historical and current data
this is usually used for comparison
As Was v.s. As Was
only historical data
this is typical not a query requirement, usually people don't need to see this.
Like v.s. Like
intersection between historical and current data
this is usually used for analyzing constant things
<start_date, end_date> combined with employee, together as a record in "LU_EMPLOYEE" table
Cons: tedious and time consuming