Please enable JavaScript.
Coggle requires JavaScript to display documents.
DATA REPROCESSING (Measures for data quality: (Consistency, Timeliness,…
DATA REPROCESSING
Measures for data quality:
Consistency
Timeliness
Completeness
Believability
Accuracy:
Interpretability
Major Tasks in Data Preprocessing
Data transformation
Strategies:
Attribute construction
Aggregation
Normalization
Data Normalization Methods
z-score normalization
normalization by decimal scaling
min-max normalization
Discretization
techniques
Histogram analysis
Cluster analysis
Binning
Data integration
Data Integration Issues:
Entity identification problem
can be used to avoid errors in schema integration.
Metadata
Tuple duplication
Data value conflict detection and resolution
Redundancy and correlation analysis
Redundancy Data occurs
Dimension naming
Derivable data:
Data cleaning
Inaccurate (noisy)
Reasons for noisy data
Faulty data collection instruments
Human errors at data entry
Data transmission problems
Technology limitation
Inconsistency in naming convention
How to Handle Noisy Data?
Data Smoothing techniques
Regression
Binning
Smooth by bin means
Smooth by bin medians
Smooth by bin boundaries
Clustering
Inconsistent:
Incomplete(missing)
Reasons for missing data
Attributes may not be applicable to all cases
Inconsistent with other recorded data and thus deleted
Information is not collected
Human/Hardware/Software problems
How to Handle Missing Data?
Fill in the missing value manually
Fill in the missing value automatically
Ignore the tuple
Data reduction
Dimensionality Reduction
such as Attribute subset selection
Basic heuristic methods of attribute subset selection:
Stepwise forward selection
Stepwise backward elimination
Combination of forward selection and backward elimination
Decision tree induction
Numerosity Reduction
Parametric methods
such as Regression
Non-parametric methods
such as Histograms, Clustering, Sampling
Types of Sampling
Simple random sample with replacement
Cluster sample
Simple random sample without replacement
Stratified sample
Data Compression
Lossy
lossless