Please enable JavaScript.

Coggle requires JavaScript to display documents.

Data Exploration (ref. lectures of Marco Brambilla 2018) (steps (5-…

- - - - In correlation, the aim is to draw a line through the data such that the deviations of the points
        from the line (xn) are minimised
        
        Because deviations can be negative or positive, each is first squared, then the squared deviations are added together, and the square root taken
        
        -1: perfect negative linear correlation +1:perfect positive linear correlation 0: No correlation
        
        can be visualise by:
        
        Faceting
        
        Same variable plot Against different «Facets» (categories or groups)
        
        Also with Trend lines
        And confidence (with smoothing)
        
        Matrix Plot
    - - Categorical & Continuous
        
        Draw box plots for each level of categorical variables.
        
        If levels are small in number, it will not show the statistical significance.
        
        To look at the statistical significance we can perform Z-test, T-test or ANOVA.
      - Continuous & Continuous
        
        no correlation, strong (positive/negative), moderate (positive/negative), curvilinear relationship.
        
        SAMPLES AND POPULATIONS
        
        Comparative tests
        
        How to compare two samples?
        
        In essence, the t-test gives a measure of the difference between the sample means in relation to the overall spread
        
        t-test use t-test (only for 2 samples)
        
        COMPARISONS BETWEEN THREE OR MORE SAMPLES
        
        • F-test and ANOVA
        
        ANOVA and Bartlett’s Test
        
        2 more items...
        
        Kruskal-Wallis Test
        
        3 more items...
        
        Essentially, ANOVA involves dividing the variance in the results into: Between groups variance
        Within groups variance
        
        IN SUMMARY
        
        When to use t-distribution?
        
        when the sample size is small and/or
        when the population variance is unknown
        
        Main parameter:
        Degrees of freedom = n-1 With n=size of the sample
        
        null hypothesis (to be tested)
        
        COMPARING TWO SAMPLES
        
        Type I error
        
        The a level represents the probability of finding a significant difference between the two means when none exists
        
        Type II error
        
        The b level represents the probability of not finding a significant difference between the two means when one exists
      - Categorical & Categorical
        
        Two-way table: count and count%.
        
        Stacked Column Chart: visual form
        
        Chi-square
        
        derive the statistical significance of relationship between the variables.
        
        A formal statistical test to determine whether results are statistically significant
        
        Also, it tests whether the evidence in the sample is strong enough to generalize that the relationship for a larger population as well.
        
        based on the difference between the expected and observed frequencies in one or more categories in the two-way table.
        
        It returns probability for the computed chi- square distribution with the degree of freedom.
        
        THERE IS A FORMULA TO COUNT IT
        
        the higher the chi-square value, the greater the likelihood there is a statistically significant difference between the two groups
        
        To know for sure, you need to look up the p-value in a chi-square table
        
        Types of Chi-square Statistical Tests
        
        Analysis of Categorical Data
        
        Measure of association (risk ratio or odds ratio)
        Confidence interval
        
        Confidence
        
        Probability of 0: It indicates that both categorical variable are dependent
        
        Probability of 1: It shows that both variables are independent.
        
        Probability less than 0.05: It indicates that the relationship between the variables is significant at 95% confidence.
  - - - can reduce the power of a model
      - can lead to a biased model
      - THIS is because we have not analysed the behavior and relationship with other variables correctly.
      - It can lead to wrong prediction or classification.
    - - Deletion
        Mean - Mode – Median imputation
        Prediction
        Clustering-based
  - - - Continuous variables:
        
        measure of dispersion
        
        quartile
        
        IQR
        
        range
        
        Variance
        
        standard deviation
        
        Skewness and Kurtosis
        
        visualisation methods
        
        Histogram
        
        Box Plot
        
        central tendency
        
        min
        
        max
        
        median
        
        mode
        
        mean
        
        outliers & missing values
      - Categorical variables
        
        Frequencies of categories
        
        Barcharts
  - - - Box-plot, Histogram, Scatter Plot
      - beyond the range of -1.5 x IQR to 1.5 x IQR
        out of range of 5th and 95th percentile
        three or more standard deviation away from mean
        multivariate outliers are measured using index of influence or leverage, or distance.
        Mahalanobis’ distance and Cook’s D
    - - Deleting observations: if due to data entry error, data processing error or outlier observations are very small in numbers. We can also use trimming at both ends
      - Transforming and binning values: log functions or discretization
      - Imputation & Separate treatment
  - - - Change of Scale, Linearization, Normalization, Binning
    - - Crucial for quality of machine learning
        • Derived variables
        • Dummy variables (binarization)