Please enable JavaScript.

Coggle requires JavaScript to display documents.

Week 11 - Data Cleaning/Mining (Data cleaning (general tips (only keep…

- - - - integer numbers
      - eg # employees,
        no. words in a text
    - - continuous numeric data
        real numbers (floating points)
      - eg income, revenue,
        temp, height
  - - - like nominal, but only 2 values
      - eg, answers to true/false questions,
        result of med test (pos/neg)
        gender (mostly)
    - - values have meaningful order
        can be ranked
      - but magnitude between
        successive values is unknown
      - eg size = sm, med, lrg
        grades, army ranks
    - - states, names of things
      - eg hair colour chosen in range,
        marital status, occupation, post codes
- - - - clustering groups data
        into similar sets
      - used to detect data
        that does not belong with
        other clusters - remove
      - statistical analysis
        eg z-scores
    - - data smoothed by
        fitting into
        regression function
      - eg, smooth values measured
        with an uncalibrated device,
        using related data from other,
        well claibrated, sensors.
    - - partition into
        equal-frequency bins
      - represent with mean,
        median or boundary (min/max)
      - sort data
- - - - binary:
        assign 1 of 2
        classes/labels
      - multi-class:
        more than 2 target
        classes/labels