Please enable JavaScript.
Coggle requires JavaScript to display documents.
Eighth reading, Ariana Alvarado Molina - 2021089068 - Coggle Diagram
Eighth reading
Data Warehousing and Mining
Approach to gain insights -> detect various patterns in large volumes of data.
Decision-Support Systems
Classification of Database Applications:
(SPT) (SATD).
Transaction Processing Systems (TPS):
Record detailed transaction information.
Decision Support Systems (DSS):
Objective: Obtain information from detailed SPT data.
They facilitate strategic and managerial decisions.
Examples of SATD Decisions:
Product selection in a store.
Product production planning.
Enterprise Databases:
Huge amounts about customers and transactions.
Transaction Information:
Includes(name or identifier), items purchased, price and dates.
Product Information:
Crucial for inventory and sales management.
Decision-making raises several issues:
SQL Limitations
Statistical Analysis and Packages
Data Warehouses for Diverse Data
Knowledge Discovery and Data Mining
Data Warehousing
Limitations of SQL:
Some queries are difficult to express in SQL.
Statistical analysis and packages:
Specialized packages (SAS, S++) interact with databases.
Knowledge Discovery and Data Mining:
Discovery techniques discover rules and patterns .
Components of a Data Warehouse
When and how to gather data.
Data Collection.
Total Update Challenge.
Impact on Decision Support Systems.
What schema to use.
Integrates diverse data sources with different schemas.
Converts and stores data in a unified schema.
Data transformation and cleansing
Uses fuzzy lookup for approximate matching.
Enhances data accuracy by deduplication.
How to propagate updates
If the relationships are identical, the propagation is direct.
What data to summarize
Summarize large data for efficiency.
Use aggregated summaries to answer queries.
Warehouse Schemas
Designed for analysis using OLAP tools.
Multidimensional data organized in fact tables.
Column-Oriented Storage
Traditional: Row-oriented storage.
Modern: Column-oriented storage.
Accessing values involves reading from specific files at calculated offsets.
Column-oriented storage has two advantages:
Efficient Attribute Access
Improved Compression Effectiveness
Drawback: Single tuple operations involve multiple I/O.
Use: Less in transactions, more in data warehousing.
Data Mining
Semi-automated analysis of large databases for patterns.
Focus on "knowledge discovery in databases."
Knowledge in the form of rules predict outcomes in variables.
Other Forms of Data Mining
Text mining
Applies data-mining to textual documents.
Tools cluster pages a user has visited.
Data-visualization
Maps, charts, and graphics present data compactly.
Encoding problem locations in a special color, like red, on a map.
Data Warehousing and Mining
Classification
Predict new item classes based on past instances.
Education and income guide classification decisions.
Aim to categorize customers as excellent, good, average, or bad.
Decision-Tree Classifiers
Uses a tree structure for classification.
Traversal from root to leaf based on data.
Internal nodes evaluate data instances.
Building Decision-Tree Classifiers
Merge ranges with the same class for efficiency.
Start with one node, root, containing all instances.
Create child nodes based on attribute values.
Choose an attribute for partitioning (e.g., "degree").
Best Splits
Move from mixed-class sets to pure leaves.
Choose attributes/conditions maximizing purity.
Assess purity for effective attribute and condition selection.
Finding Best Splits
Focus on binary splits for simplicity.
For attributes with many values, combine into fewer children.
Multiway splits work for few values (e.g., "degree" or "gender").
Decision-Tree Construction Algorithm
Minimize costs for large datasets.
Various methods with unique features.
Use cutoffs for efficient recursion.
Vary branch depth based on data.
Other Types of Classifiers
Include neural-net and Bayesian classifiers.
Utilize artificial neural nets for training.
Support Vector Machine
Basic intuition about SVM provided here.
Highly accurate classifier across diverse applications.
Regression
Focuses on predicting a value, not a class.
Given variables X1, X2, ..., Xn.
Aim is to predict the value of variable Y.
Validating a Classifier
Essential to check error rate before application.
Predicting disease X based on certain inputs.
Use known outcomes to measure error.
Various ways to measure classification quality exist.
Accuracy
Recall
Precision
Specificity
Depends on the specific application needs.
High recall important for screening tests.
Association Rules
Shops seek associations between items customers buy.
Shops may discount or not based on buying patterns.
Rule: bread ⇒ milk
Shops analyze item associations in customer purchases.
Association Rules
Support
milk ⇒ screwdrivers
Fraction of the population satisfying both antecedent and consequent.
Low support rules lack statistical significance.
Confidence
bread ⇒ milk
Measures how often the consequent is true when the antecedent is true.
Example Scenario:
Discover rules of the form i1, i2, ..., in => 10.
Other Types of Associations
Predictable associations may lack interest.
Predictable associations may lack interest.
Standard measures highlight interesting associations.
Clustering
Finding groups of similar data points.
Finding groups of similar data points.
Complex schemes group species at different levels.
Ariana Alvarado Molina - 2021089068