Please enable JavaScript.
Coggle requires JavaScript to display documents.
Data Clean II
Chap 2 (May contain (Outliers (Numerical method (If has z…
Data Clean II
Chap 2
May contain
-
Missing values
If omit, might be systematic + loss of data
-
-
-
-
-
-
-
[Misclassification]
For categorical variables, check for frequencies, might give hint
Outliers
-
-
-
-
Numerical method
If has z score out of 3, -3 - then outlier - but mean and median sensitive to outliers, so not very good method
-
Don't automatically remove, investigate first
-
Data Transformation
Variables with a bigger range have greater impact, so we normalize
-
How to normalize
Min- Max
- take lowest and high value then calculate range
- Lowest value = 0, highest = 1
-
Z-score standardization
-
-
data values above mean, above 0
-
Decimal Scaling
divide by 10^d, where d no of digits after . in the largest absolute value
-
-
-
-
Binning
- When an algorithm wants categorical variables
Equal width binning
k bins of equal width, analyst decides k
-
Binning by clustering
decide k, it decided optimal
-
First 1 is greatly affected by outliers and 2nd one assumes each bin equally likely - therefore other 2 better ones
-
Index Field
-
For example, using IBM/SPSS Modeler, you can use the Index function in the Derive node to create an index field.
-
-