Please enable JavaScript.

Coggle requires JavaScript to display documents.

Chp 1.1 Process of Data Mining (Removing variables (When over 90% values…

- - - - delete records?
        
        Not necessarily best approach
        
        Pattern of missing values may be systematic
        
        Deleting records creates biased subset
        
        Valuable information in other fields lost
      - alternates
        
        Replace Missing Values with User-defined Constant
        ie. Missing, 0.0 etc
        
        Replace Missing Values with Mode or Mean (can also considered median if skewness >|1|.
        
        Replace Missing Values with Random Values (from underlying distribution) - superior to mean method - measure of spread + location remain similar to original
    - - A histogram examines values of numeric fields
      - Two-dimensional scatter plots help determine outliers between variable pairs
- - - - Min-max - works by seeing how much greater the field value is than the minimum value min(X), and scaling this difference by the range
      - Z - score - works by taking the difference between the field value and the field mean value, and scaling this difference by the standard deviation
      - Decimal Scaling - Xd = X/10^d, where d represents the number of digits in the data value with the largest absolute value
        
        ensures that every normalized value lies between -1 and 1
      - Even in z score, normalizing doesn't imply data normality, here we measure skewness to determine the symmetricity of the distribution
        
        We can eliminate skewness by using the following transformation techniques
        
        ln(x)
        
        √𝑥
        
        1/√𝑥
        
        to check for normality
        
        we construct a normal probability plot, which plots the quantiles of a particular distribution against the quantiles of the standard normal distribution
        
        if normal, most points on str8 line