Please enable JavaScript.

Coggle requires JavaScript to display documents.

Data Clean II Chap 2 (May contain (Outliers (Numerical method (If has z…

- - - - With some constant specified by analyst
      - field mean (confidence might be over optimistic)
      - mode for categorical variables
      - random value (spread stays same)
      - imputed value based on characteristic of record best
  - - - If has z score out of 3, -3 - then outlier - but mean and median sensitive to outliers, so not very good method
      - IQR - If 1.5 IQR above Q3 or below Q1 - a better method
  - - - Affected by outliers
      - should be around middle if data not skewed
    - - less affected by outliers
    - - can even use with categorical variables
      - but not necessarily at center
    - - mean, median & mode same but possibly data shape diff
      - Can use Min - Max range
      - Standard Deviation (most used)
      - IQR
      - mean absolute deviation (not as sensitive as SD when extreme values involved)
  - - - Min- Max
        
        take lowest and high value then calculate range
        
        Lowest value = 0, highest = 1
        
        Will range from 0 to 1
      - Z-score standardization
        
        mean becomes 0
        
        SD = 1
        
        data values above mean, above 0
      - which to use depends on data
      - Decimal Scaling
        
        divide by 10^d, where d no of digits after . in the largest absolute value
        
        normalized values lie b/w -1 and 1