Please enable JavaScript.
Coggle requires JavaScript to display documents.
Data Mining, Data Mining & KDD – Big-Picture Mind-Map - Coggle Diagram
Data Mining
Data Mining & KDD – Big-Picture Mind-Map
Data Mining vs. KDD
Definition of KDD (non-trivial, valid, novel, useful, understandable patterns)
Data-mining step within KDD (algorithms that enumerate patterns)
KDD 9-Step Process (1996)
1 Domain Understanding & Goal
2 Data Selection
3 Cleaning / Pre-processing
4 Transformation / Reduction
5 Task Choice (classification, etc.)
6 Algorithm & Hyper-parameter selection
7 Execute Data-mining
8 Interpretation / Evaluation
9 Consolidation & Deployment
Big-Data 3 Vs (Volume, Variety, Velocity)
Pre-processing & Transformation
Data Cleaning
Missing values (drop tuple, drop attribute, impute, “unknown”)
Noise detection (semantic checks, clustering, outlier detection)
Transformation
Re-coding (normalisation, type conversion)
Generalisation (binning, clustering colours → basic tones)
Aggregation (SUM, AVG, COUNT, daily sales)
Attribute construction (e.g., profit = revenue − cost)
Data Reduction
Dimensionality reduction (remove irrelevant / redundant attrs)
Sampling (random, cluster, stratified)
Evaluation & Validation
Confusion Matrix (TP, FP, FN, TN)
Accuracy, Precision, Recall, Specificity, F1
Train–Test Split
m-fold Cross-Validation
Completeness & Consistency of Concepts
Pruning of Decision Trees (pre- & post-pruning)
Early Stopping for ANNs
Supervised Learning
Decision-Tree Family
TDIDT (ID3, C4.5, C5.0)
Entropy & Information Gain
Handling continuous attrs, value grouping, pruning
Tree representation ↔ propositional formula
Instance-Based Learning
Case / Case Base concepts
Nearest-Neighbor principle (distance vs similarity)
Voronoi interpretation
Knowledge-intensive similarity
Bayesian Learning
Bayes’ Theorem, MAP hypothesis
Naïve Bayes
Conditional independence assumption
Discrete attrs (Laplace / m-smoothing, zero-prob problem)
Continuous attrs (Gaussian assumption)
Neural Networks
McCulloch-Pitts neuron (weights, threshold, Boolean output)
Single-layer Perceptron limitation (XOR)
Multilayer Perceptron (input, hidden, output, bias)
Non-linear activations (Sigmoid, Tanh, ReLU)
Back-propagation
Forward pass → loss → backward pass → weight update
Batch vs Mini-batch gradient descent
Deep-NN challenges
Vanishing / exploding gradients
Mitigation: Xavier / He init, ReLU, BatchNorm, skip conn.
Slow training, data scarcity, over-fitting (dropout, early stop)
Unsupervised Learning – Clustering
Partitioning Algorithms
k-Means (centroid, random start, sensitivity to k & noise)
k-Medoid / PAM (medoid swap, robust but slower)
Hierarchical Clustering
Agglomerative (AGNES) – linkage: single, complete, average, centroid
Divisive (DIANA)
Dendrogram & cut-level
Density-Based Clustering
DBSCAN core ideas (ε, MinPts)
Core, directly density-reachable, density-reachable, connected
k-distance elbow to pick ε
Pros: arbitrary shapes, no preset k, noise detection
Cons: parameter choice, varying density, high-dim curse
Other Families
Density, Conceptual, Grid-based, SOM / Kohonen, Graph-theoretic
Data Warehousing & Architecture
DWH characteristics (domain-oriented, integrated, non-volatile, time-stamped)
Mucksch & Behme components mapped to KDD steps
Operational sources → Goal & Selection
ETL → Clean & Transform
Warehouse core → Transformation
Data Marts / OLAP → Aggregation & Exploration
DM tools → Data mining step
Presentation layer → Evaluation & Deployment
Meta-data layer → supports all steps
When DM is too complex
Simpler analytics (OLAP, dashboards)
Expert / rule-based systems
Consultants / cloud analytics
Prototype & incremental skill build
Data quality first
Training & change management
Scenario analysis, simulations
Outsource advanced modelling