Please enable JavaScript.

Coggle requires JavaScript to display documents.

Statistics - Coggle Diagram

- - - - if Ha is accepted and checking for normality also meet: use tukey's HSD test to check which pairs have the relationship
        
        TukeyHSD(fit, conf.level = ??%)
    - - MSB (mean sqaure between): SSB (Sum of square between)/ dfB (degree of freedom between)
      - MSW(mean square within): SSW (Sum of square within)/ dfW (degree of freedom within)
      - if F-statistic is large & p-value is small (below pre-defined significance level) -> there's a significant difference between the means of group
- - - - graphical display
        
        stacked/ grouped bar plot
        
        design for 2 nominal (ordinal) variables) 兩個變數的長方圖
        
        histogram
        
        design for interval variable, showing counts/ percentages
        
        bar plot
        
        design for nominal or ordinal variable; if ordinal: be sure to order the values on the x-axis 一個變數的長方圖
        
        scatter plot
        
        show relationship between 2 interval variable (小考： relationship between Day of the week and Predicted weather?)
        
        box (& whiskers) plot
        
        design for ordinal/ interval data
      - numerical summary
        
        percentiles
        
        x is the Pth percentile of the data if the percentage of the data that are less than or equal to x is P. The number P is the percentile rank of x
        
        quartiles
        
        Five-number summary: {xmin,Q1,Q2,Q3,xmax}
        
        interquartile range (IQR) = Q3 - Q1
        
        Z-scores
        
        compute data set distance from the mean in units of standard deviation
        
        empirical rule
        
        bell-shaped data distribution
        
        68%/ 95%/ 99.7%
        
        Chebyshev's rule
        
        at least 3/4 (75%) of the data lie within 2SD of the mean;
        at least 8/9 (88.89%) of the data lie within 3 SD of the mean
      - distribution shape
        
        symmetric (mirror images of right and left distribution)
        
        positively (right) skewed
        
        negatively (left) skewed)
      - numerical measure
        
        Population Parameter
        
        μ (mean) ; σ (standard deviation) ; p (proportion)
        
        we don't know, but assumed to be fixed
        
        Sample Statistics
        
        𝑋 bar(mean) ; s (standard deviation) ; ̂ p hat (proportion)
        
        known values
        
        Statistical Characteristic
        
        Center of Location (中心位置)
        
        mean
        
        median
        
        mode
        
        Dispersion (分散程度)
        
        Range
        
        variance
        
        standard deviation
    - - drawing conclusion (estimates, predictions)
      - say sth about the process that generated
  - - - measurements in distinct categories, with no given order; categorical
      - CALCULATION: Frequencies, proportion, mode
    - - measurement categories are ordered, convey additional order numbers, identifying higher & lower values now makes sense
      - CALCULATION: median, percentiles, range, interquartile range
      - If distance between adjacent values are reasonbly comparable, we often analyzed as interval data
    - - different between scale values are comparables; quantitative
      - CALCULATION: mean, standard deviation, skewness; common arithmetic calculations are valid
- - - - non-sampling error (bias) - improper wording/ leading language, problem in the way of selecting subset
        
        Sample 100 current CSOM student of the mean number of hours worked
        non-sampling error: only sample MS Finance student/ respondents over-report
      - sampling error - the sample would not match what exists in the process necessary (unavoidable). The variability is a sample error.
- - - - X = the number of successes in n Bernoulli trail 每次的結果只有 success 或是 fail
      - probability density function
        dbinom(p, trails or sample_size, probability)
        
        問題：get the probability of certain number of success
        a person makes 70% of his throw attempts, and if he shoots 20 throws, so what will be the probability that the person makes exactly 12 of them attempt
      - cumulative density function
        pbinom(x, trails or sample_size, probability)
        
        問題：calculating the probability to get a head more than 3 times if the coin is flipped fairly 10 times
        key word: more than/ less than
      - inverse cumulative density function
        qbinom(q, size, p)
        
        問題：get the xx th quantile of a binomial distribution with n trails of p
        key word: find cutoff
    - - X = number of success in a given interval
    - - mean = mu = E(X)
      - variance σ2 = ∑[( x − µ)2 P(x)]
    - - X = “Success” or “Failure” ε {0, 1}
      - Standard deviation: np 根號
    - - X = the number of trails until the rth success occurs
  - - - Continuous, symmetric, mound-shaped distribution
      - median = mode = mean
      - Standard normal distribution Z ~ N(0,1)
      - probability density function
        dnorm(x, mean, sd)
      - dnorm (density function): (x, mean, sd)
      - pnorm (culmulative function): (q, mean, sd)
      - qnorm (quartile): (probability, mean, sd)
    - - Use T-distribution instead of normal distribution when σ is unknown (not because N is small)
      - shape depends on degrees of freedom (df); when n 趨近於無限，t distribution get closer to normal distribution
      - symmetric 0 center, mean = mode = median = 0
      - standard deviation >1（如果 t 分布的標準差小於1，這意味著對於給定的自由度（degrees of freedom），該分布的變異性比正常情況下更小）
      - t.test(x, y = NULL, alternative = c("two.sided", "less", "greater"), mu = 0, paired = FALSE, var.equal = FALSE, conf.level = 0.95, ...)
    - - statistics that involves a ratio of variance
- - - - Due to sample error, X bar is unlikely to equal to μ
    - - interval = Best Point Estimate (X bar) ± (test statistic) x (estimated variability of sampling distribution)
        
        test statistic (測試統計量）
        
        form of distribution (t-distribution)
        
        acceptable error level of
        
        estimated variability of sampling distribution
        
        Calculate confidence interval for μ
        STEP 1. find sample mean/ sample size/ standard deviation/ standard error
        STEP 2. calculate t-score = qt(p = alpha/2, df, lower.tail = F) or t.test(data$column_name, conf.level = ???)
        STEP 3. calculate upper/ lower bound
        
        Calculate confidence interval for p from the binomial distribution
        
        Wilson score interval method
        prop.test(x, n, conf.level)
    - - we have a (1 - alpha) 100% degree of belief that we obtained an interval that contains p
        i.e. we believe that there's a 90% chance that we obtained an interval that contains μ
  - - - step 1: set 2 competing, non-overlapping hypotheses (Ha, H0)
        
        Ha: the claim that we want to establish: μ [ >, <. or not equal] μ0
        
        H0: competing hypotheses
      - step 2: make assumption
        
        random sampling
        
        stability over time
        
        if n <30, X (process)~N
        
        assume that H0 is true
      - step 3: get sample information
        P-value = P(Sample result | H0 is true)
        if p-value <= a, reject H0
    - - compare "actual but unknown" V.S. "Decision"
        
        H0 Is true, not reject H0: correct decision
        
        H0 is true, but reject H0: type 1 error (alpha)
        
        Ha is true, but not reject H0: type 2 error
        
        Ha is true, reject H0: correct decision
    - - increasing n (sample size) can decrease standard error, which makes distribution narrower
      - reducing alpha by moving the cutoff means increasing beta
    - - Critical Value: Cut-off
      - p-value：the probability of getting a sample result same or more extreme than observed result, given H0 is true
    - - hypothesis:
        H0: X ~ N(., .), normally distributed
        Ha: X != N(., .), NOT normally distributed
      - 3 step evaluation
        
        histogram (check distribution shape)
        
        QQ plot (probability plot)：qqnorm + qqline
        
        fit test: Shapiro-Wil test