Please enable JavaScript.

Coggle requires JavaScript to display documents.

STATISTICS FOR DATA SCIENCE (:four: Outlier (detect (fences = 1.5 x IQR …

- - - - Organize
      - Summarize
      - Simplify
      - Describe & present data
      - example
        
        what study strategies used by students
        
        no. of students who experienced stress
    - - Generalise sample to population
      - Hypothesis testing
      - make predictions
      - example
        
        whether 1 group of students experience more stress than another
      - assess relationship between variables
  - - - :one: Sample & Population
      - :two: Sample Statistics
      - :three: Bootstrap
      - :four: Outlier
      - :five: Statistical Model
      - :six: Confounding & other factors
      - :seven: p-values
- - - - no access to population data
      - so, use sample
  - - - is the result from the sample good enough :question:
        
        :+1: 98% CI
        
        if not good, how big a sample size should be :question:
- - - - sample variance
      - sample proportion
      - sample mean
  - - - e.g. population mean
      - H0 & H1 are statements of population parameters
      - assumption
        
        population parameter estimation
        
        distribution of the interested variable is adequately described by a distribution with one or more (unknown) parameters
  - - - no. of cases in a sample
    - - collection of sample statistics
        
        from all trials
      - large # of trials
      - What is sampling dist?
      - to know what is the frequency of distribution of the sample statistics e.g. sample mean, from all of the random samples (each sample is the size n)
      - mean
        
        sampling distribution of
        sample mean
        
        approach a normal dist., bell shaped if n is large
        
        closer to TRUTH
        
        The Central Limit Theorem states that the sampling distribution of the sample means will approach a normal distribution as the sample size increases.
        
        CLT regardless of shape of population
        
        normal dist. is where symmetrical where mean = median = mode
        
        :+1: better if its closer to truth
        
        larger n
        
        sampling dist of sample mean, KHAN
    - - skewness
      - nothing abnormal with other dist. shape
    - - sd of the sampling distribution
        
        :eye: width of sampling dist.
        (hist/density plot)
      - sampling distribution of sample mean with different n sample sizes
        
        A :black_large_square: larger n
        yield :black_small_square: smaller standard error of sampling dist.
      - :+1: smaller standard error ->> the more reliable
        
        larger n
- - - - large # of samples of same sample size n, are repeatedly drawn from the original sample
        
        instead of drawing random sample n from population i.e. normal sampling
  - - - repeated sampling yield data that approximate the true population data
        
        sample mean, variance & SD converge to respective parameter estimates
      - describe the results of performing the same experiment a large # times
  - - - think of the sampling from population but allow duplicates
    - - but each bootstrap sample is SAME SIZE
- - - - remove
        
        should never drop unless clear rationale
        
        dropped outliers need to be reported
      - change or bring value into range
        
        fix error or irregular outliers
  - - - large outlier: data point > upper fence
      - small outlier: data point < lower fence
      - IQR: Inter Quartile Range,
        quartile 3 - quartile 1
        Q3 - Q1
- - - - some variables without specific value
        
        variable with probability distributions
        
        stochastic variables i.e. randomly determined
- - - - hidden effect on dependent variable
        
        e.g. can suggest correlation when in fact there isn't
    - - correlate directly or inversely with both dependent & independent variable
      - estimate fail to take into account the confounding factor
  - - - random distribution of confounders between study groups
    - - equal distribution of confounders, of individuals or groups
    - - confounders are distributed evenly within each stratum
      - i.e. grouping - divide the variable into group (see example lecture notes slide 46)
    - - usually distorted by choice of standard
    - - only works if confounders identified and measured
      - e.g. multiple regression
- - - - Confidence level is 95%
      - Type 1 error rate is 0.05
        
        the error of rejecting the true null Hypothesis, H0
        
        false positive finding
        
        Link
        I am what is default, the status quo. I'm already accepted, can only be rejected. The burden of proof is on the alternative. I am H0
      - finding is significant at 0.05 level
      - the alpha level is 0.05
      - the P-value is 0.05
  - - - :+1::skin-tone-5: the smaller P-value
        
        the more stringent the test
        
        the greater likelihood that the conclusion is correct
        
        indicates the result was unlikely to have occurred by chance alone.
        result is statistically significant
      - Result is statistically significant if p-value <= significance level
    - - help draw conclusion Hypothesis
        
        P > 0.05
        
        weak evidence against H0
        
        can't reject H0
        
        P <= 0.05
        
        strong evidence against H0
        
        can reject H0
        
        P = 0.05
        
        marginal value
        
        possible either way
      - to answer How much confidence on the outcome of hypothesis test :question:
    - - indicate if data is incompatible with the statistical model
      - do not measure prob of hypothesis is true &
        data produced by random chance
      - scientific conclusion, biz, policy can't be solely based on p-value
      - proper inference requires full reporting & transparency
      - American Statistical Association
      - p-value/statistical significance does not measure the size of an effect or the importance of a result
      - by itself, p-value not a good measure of evidence regarding model or hypothesis
  - - - calculate P(type I or II error)
        
        Type I error
        
        falsely concluding that the intervention is successful
        
        Reject the true H0
        
        false positive findings
        
        wrongly classified a non-event as event
        
        Type II error
        
        falsely concluding that intervention wasn't successful
        
        fail to reject H0 that is false
        
        false negative result
        
        wrongly classify event as non-event
    - - randomized controlled experiments with two variants A & B
        
        A control
        
        B Variation