Please enable JavaScript.
Coggle requires JavaScript to display documents.
Chp 1.1 Process of Data Mining (Removing variables (When over 90% values…
Chp 1.1 Process of Data Mining
Why We Preprocess
Raw data often incomplete, noisy
May contain
Obsolete fields
Missing values
delete records?
Not necessarily best approach
Pattern of missing values may be systematic
Deleting records creates biased subset
Valuable information in other fields lost
alternates
Replace Missing Values with User-defined Constant
ie. Missing, 0.0 etc
Replace Missing Values with Mode or Mean (can also considered median if skewness >|1|.
Replace Missing Values with Random Values (from underlying distribution) - superior to mean method - measure of spread + location remain similar to original
Outliers
A histogram examines values of numeric fields
Two-dimensional scatter plots help determine outliers between variable pairs
Data in form not suitable for data mining
Erroneous values
misclassification
Minimize GIGO
Data often from legacy databases
Not looked at in years
Expired
No longer relevant
Missing
Data preparation is 60% of effort for data mining process
Therefore, must undergo
data cleaning
#
data transformation
Variables with greater ranges tend to have larger influence on data model’s results
Therefore, numeric field values should be normalized
types of normalization
Min-max
- works by seeing how much greater the field value is than the minimum value min(X), and scaling this difference by the range
Z - score
- works by taking the difference between the field value and the field mean value, and scaling this difference by the standard deviation
Decimal Scaling
- Xd = X/10^d, where d represents the number of digits in the data value with the largest absolute value
ensures that every normalized value lies between -1 and 1
Even in z score, normalizing doesn't imply data normality, here we measure skewness to determine the symmetricity of the distribution
We can eliminate skewness by using the following transformation techniques
ln(x)
√𝑥
1/√𝑥
to check for normality
we construct a
normal probability plot
, which plots the quantiles of a particular distribution against the quantiles of the standard normal distribution
if normal, most points on str8 line
Dummy Variables
When a categorical predictor takes k possible values, then define k-1 dummy variables, and use the unassigned category as the reference category
Don't transform to numerical vars - The exception is for categorical variables that are clearly ordered (always, sometimes, never ...)
sometimes have to transform numerical variables (or several categorical vars) - into fewer bins / bands.
Index Field
It is recommended that the data analyst create an index field, which tracks the sort order of the records in the database
Data mining data gets partitioned at least once (and sometimes several times)
It is helpful to have an index field so that the original sort order may be recreated
Removing variables
When unary
Sometimes, when almost unary
When over 90% values missing
Good idea to make a flag var to check for systematic problems
When 2 vars highly correlated
PCA might be good idea
Remove truly duplicate records (consider if duplicate by entry or by chance - dn't remove latter)
IDs
be filtered out but not removed