Please enable JavaScript.
Coggle requires JavaScript to display documents.
Data quality & Data preprocessing - Coggle Diagram
Data quality & Data preprocessing
2.1 Defintion of quality&
Criteria for its evaluation
2.1.1. Quality. ISO 9000:2015.
Degree to which a set of inherent characteristics fulfuils requirements
Requirements: req of the user
Set of inherent characteristics:
criteria concerning diff characteristics
2.1.2. Criteria to evaluating
data quality
Relevance
Do the data address current/potential needs;
Inform issues of importance to the dats users
Accuracy
Represent phenomena they
were designed to measure
Reliability
Quality over time,
repeatability of the data
Timeliness
(currency)
Diff b/w time when data
collected and become available
Punctuality
Release date VS Target data
Coherence
("consistency")
The extent to which data c/b used
with other data and over time. Irrelevant details; confusing mearues; ambiguous formats. Beyond numerical consistency
Compatibility
If compatible in format in
definition with other data
Completeness
No missing values
Accessibility
Easy of accessing the data
Interpretability
If useful for interpretation
of data and their uses
Security
Physically and logicylly secure?
Trust
Believability, reliability, reputation,
if from authoritative source
Usefulness
Advantages to use the data
120 : Considerations:
Data quality:
Don't need to be perfect
Quality trade-offs
2.2. Data Preprocessing
2.2.1 Data sources & tasks
in data preprocessing
Analytics has simplistic view of data
as table with well-defined columns
In real life data don't look as single table,
come from diff sources in diff formats, always dirty
Data preprocessing: bring together
all sources, extract useful features
=> biggest challenge in analytics
Data Sources
Relational data warehouse
("star schema format")
Multi-dimensional
data stores (cubes)
NoSQL databases
Why data preprocessing?
Real-world data
are always dirty
Need to be integrated from diff sources
Have missing values
Contain errors &
inconsistent values
Contain outliuers
Do not have right level
of aggreagtions
Imp. step for
successful analytics
Quality decisions&analytics
need quality data
Tasks in data preprocessing
Data integration
Integr. of multiple databases, cubes or files
Data cleaning
Impute missing values, remove noise data,
remove outliers, resolve inconsistencies
Data transformation
Aggregation, normalisation,
"feature engineering"
Data reduction
Reduce the volume but
producing same results
Data discretisation
Reduction for quantitative data
2.2.2. Data Integration
Inconsistencies b/w
data sources
Diff. definitions & classifications
Diff. in timing
Error in one source
Diff. spelling
Diff. abbreviations
Default values, diff types
Data linkage & matching
Exact matching
(deterministic record linkage).
If there is a common indentifier
Can match or NOT
Depend on variables quality
May not be sufficient alone
Probabilistic matchnig
Use matching keys -
common variables:
address, occupation, etc
Rely on probabilities to
determine which records match
Classifying into: matches,
non-matches, possible matches
Distinguishing Power (DP) of variable:
uniqueness of the values
High DP (reference
number, full name,...)
Low DP (sex, age, nationality)
Can depend on level o detail
Intro
Problems:
Diff attributes names
Diff units
Diff scales
Derived attributes
Inconsistency due
to redundancy
Data integration combine data
from multiple sources into
a common format for analytics
Matching
techniques
Clerical matching: with human intervention
Automatic matching: min human intervention
AIm: max automatic matching, min clerical
Record linkage process
Record pair comparison
Similarity vector classification: "matches",
"non-matches", "possible matches"
Blocking | Indexing: split into blocks,
compare records in same blocks
Clerical review for possible matches
Cleaning and standardisation
Evaluation: complexity, completeness,
quality of linked records is evaluated
2.3. Quality measurement
Data High quality:
"fit to use" in their intended role
correctly represent real world
construct to which they refer
q. of internatl data "consistency"
become more important
Data preprocessing (cleaning) m/b
required to ensure quality
2.2. Data Preprocessing
2.2.3 Data Cleaning
Missing values
DISGUISED MISSING VALUES
:
unknown , inapplicable
values encoded as valid data.
(0 instead of n/a)
To detect:
Missing value plot
Pareto chart of percent
missing for each var
Distorted values
Outliers
2.2.4 Data transformation
Techniques
Aggregation
Summarisation
Cube construction
Normalization
Standardisation:
Z-score transformation
Decimal scaling: move decimal
points to all values
Min-Max scaling: to 0-1 range
Attribute/feature
construction/extraction
New attribute extraction : Principal
component analysis, clustering
Feature engineering:
"better" features dvmt
OFten most effective:
from understanding the problem
Combinatopn of predictors m/b >effective
than individual values
Qualitative variables
Too many level => problems: processing; bias
Variables w/m levels
req min of m parameters
Data reqs are proportional to the #of parameters
CURSE OF DIMENTIONALITY
Evaluate reason for large #of levels
If higher level of hierarchy is possible?
GENERALISATION
Can be used group of
vars instead, with <levels?