Please enable JavaScript.

Coggle requires JavaScript to display documents.

Data quality & Data preprocessing - Coggle Diagram

- - - - Do the data address current/potential needs;
        Inform issues of importance to the dats users
    - - Represent phenomena they
        were designed to measure
    - - Quality over time,
        repeatability of the data
    - - Diff b/w time when data
        collected and become available
    - - Release date VS Target data
    - - The extent to which data c/b used
        with other data and over time. Irrelevant details; confusing mearues; ambiguous formats. Beyond numerical consistency
    - - If compatible in format in
        definition with other data
    - - No missing values
    - - Easy of accessing the data
    - - If useful for interpretation
        of data and their uses
    - - Physically and logicylly secure?
    - - Believability, reliability, reputation,
        if from authoritative source
    - - Advantages to use the data
    - - Don't need to be perfect
- - - - Relational data warehouse
        ("star schema format")
      - Multi-dimensional
        data stores (cubes)
      - NoSQL databases
    - - Real-world data
        are always dirty
        
        Need to be integrated from diff sources
        
        Have missing values
        
        Contain errors &
        inconsistent values
        
        Contain outliuers
        
        Do not have right level
        of aggreagtions
      - Imp. step for
        successful analytics
      - Quality decisions&analytics
        need quality data
    - - Data integration
        
        Integr. of multiple databases, cubes or files
      - Data cleaning
        
        Impute missing values, remove noise data,
        remove outliers, resolve inconsistencies
      - Data transformation
        
        Aggregation, normalisation,
        "feature engineering"
      - Data reduction
        
        Reduce the volume but
        producing same results
      - Data discretisation
        
        Reduction for quantitative data
  - - - Diff. definitions & classifications
      - Diff. in timing
      - Error in one source
      - Diff. spelling
      - Diff. abbreviations
      - Default values, diff types
    - - Exact matching
        (deterministic record linkage).
        If there is a common indentifier
        
        Can match or NOT
        
        Depend on variables quality
        
        May not be sufficient alone
      - Probabilistic matchnig
        
        Use matching keys -
        common variables:
        address, occupation, etc
        
        Rely on probabilities to
        determine which records match
        
        Classifying into: matches,
        non-matches, possible matches
      - Distinguishing Power (DP) of variable:
        uniqueness of the values
        
        High DP (reference
        number, full name,...)
        
        Low DP (sex, age, nationality)
        
        Can depend on level o detail
    - - Problems:
        
        Diff attributes names
        
        Diff units
        
        Diff scales
        
        Derived attributes
        
        Inconsistency due
        to redundancy
      - Data integration combine data
        from multiple sources into
        a common format for analytics
    - - Clerical matching: with human intervention
      - Automatic matching: min human intervention
      - AIm: max automatic matching, min clerical
    - - Record pair comparison
      - Similarity vector classification: "matches",
        "non-matches", "possible matches"
      - Blocking | Indexing: split into blocks,
        compare records in same blocks
      - Clerical review for possible matches
      - Cleaning and standardisation
      - Evaluation: complexity, completeness,
        quality of linked records is evaluated
- - - - DISGUISED MISSING VALUES:
        unknown , inapplicable
        values encoded as valid data.
        (0 instead of n/a)
      - To detect:
        
        Missing value plot
        
        Pareto chart of percent
        missing for each var
  - - - Aggregation
        
        Summarisation
        
        Cube construction
      - Normalization
        
        Standardisation:
        Z-score transformation
        
        Decimal scaling: move decimal
        points to all values
        
        Min-Max scaling: to 0-1 range
      - Attribute/feature
        construction/extraction
        
        New attribute extraction : Principal
        component analysis, clustering
        
        Feature engineering:
        "better" features dvmt
        
        OFten most effective:
        from understanding the problem
        
        Combinatopn of predictors m/b >effective
        than individual values
    - - Too many level => problems: processing; bias
      - Variables w/m levels
        req min of m parameters
      - Data reqs are proportional to the #of parameters
        CURSE OF DIMENTIONALITY
      - Evaluate reason for large #of levels
        
        If higher level of hierarchy is possible?
        GENERALISATION
        
        Can be used group of
        vars instead, with <levels?