Please enable JavaScript.
Coggle requires JavaScript to display documents.
[CHAPTER 2] Module 3: Data Exploration and Data Preprocessing, STEPS IN…
[CHAPTER 2] Module 3:
Data Exploration and Data Preprocessing
Data Mining Framework:
CRISP-DM
1. Business Understanding
focus on the understanding of objectives and requirements of the project
steps taken:
determine business objective
know situation
determine data mining goals
produce project plan
What is CRISP-DM?
an open standard process model that describes common approaches used by data mining experts
CRISP-DM stands for
CRoss Industry Standard Process for Data Mining
2. Data Understanding
to identify, collect, and analyze the data sets that can help you accomplish the project goals
Steps Taken:
Collect Initial Data
Describe Data
Explore Data
Verify Data Quality
3. Data Preparation
considered as an important process in which data gets cleaned, extracted and identified.
Goals? to assure quality and useful data
Process?
may include further data visualization, data aggregation, training a statistical model
4. Modeling
build and assess various models based on several different modeling techniques
Steps Taken:
Select Modelling Techniques: Determine algorithm
Generate Test Design: might need to split the data into training, test, and validation sets
Build Model
Assess Model
5. Evaluation
which model best meets the business and what to do next
Steps Taken:
Evaluate Result: Do the models meet the business success criteria?
Review Process:
Determine Next Steps
6. Deployment
can be generating simple reports or implementing a repeatable data mining process across the enterprise.
4 tasks that normally have:
plan deployement
plan monitoring and maintanence
produce final report
review project
DATA PRE PROCESSING
Why?
To make it suitables for analysis
not to improve data but to ensure it fits a specified data mining techniques and tools
How (Major Task)
3. Data Transformation:
Normalization or standardization
4. Data reduction:
Obtains reduced representation in volume but produces the same or similar analytical results
2. Data Integration
: Integration and aggregation of multiple databases, data cubes, or files
5. Data Discretization
: Part of data reduction but with particular importance, especially for numerical or continuous data
1. Data Cleaning
: Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies
Data Exploration
TERMINOLOGY OF DATA
Dataset
: is a collection of object
Characteristic of Datasets:
Time series
: series of data points ordered in time, eg: Stock Exchange
Graph-based data
: database that uses graph structures for semantic queries with nodes, edges, and properties to represent and store data.
Sequential data
: any kind of data where the order matters as you said, eg: time-series data
Spatial data
: the relative geographic information about the earth and its features
Dimensionality
: a structure that categorizes facts and measures in order to enable users to answer business questions, eg:people, products, place and time
Sparsity
: many gaps present in the data being recorded, eg: sensors signal
Resolution
: number of pixels, eg: image display
Record data
: basic data structure, eg: rows in database
Transaction data
: has a time dimension, a numerical value and refers to one or more objects
The Data Matrix
: two-dimensional code consisting of black and white "cells" or dots arranged in either a square or rectangular pattern, eg: QR-Code
Data
: attribute of values
Attribute
: property or characteristics of an object
Population:
whole set of items under consideration
Sample:
part of the population that has been selected
UNDERSTANDING DATA // TYPE OF DATA
DATA QUALITY
Aspect:
Incomplete
:
lacking attribute values
eg: occupation=“ ”
Inconsistent
:
containing discrepancies in codes or names
eg:
Was rating “1,2,3”, now rating “A, B, C”
Noisy
:
containing errors or outliers
eg: Salary=“-10”
Multi-Dimensional Measure
Accuracy
Completeness
Consistency
Timeliness
Believability
: the extent to which users believe data in a dataset are accurate
Value added
Interpretability
Accessibility
STEPS IN CRISP-DM
*
object = values